2014-06-24 20:51:38

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 0/9] seccomp: add thread sync ability

This adds the ability for threads to request seccomp filter
synchronization across their thread group (at filter attach time).
For example, for Chrome to make sure graphic driver threads are fully
confined after seccomp filters have been attached.

To support this, locking on seccomp changes via thread-group-shared
sighand lock is introduced, along with refactoring of no_new_privs. Races
with thread creation are handled via delayed duplication of the seccomp
task struct field.

This includes a new syscall (instead of adding a new prctl option),
as suggested by Andy Lutomirski and Michael Kerrisk.

Thanks!

-Kees

v8:
- drop use of tasklist_lock, appears redundant against sighand (oleg)
- reduced use of smp_load_acquire to logical minimum (oleg)
- change nnp to a task struct held atomic flags field (oleg, luto)
- drop needless irqflags changes in fork.c for holding sighand lock (oleg)
- cleaned up use of thread for-each loop (oleg)
- rearranged patch order to keep syscall changes adjacent
- added example code to manpage (mtk)
v7:
- rebase on Linus's tree (merged with network bpf changes)
- wrote manpage text documenting API (follows this series)
v6:
- switch from seccomp-specific lock to thread-group lock to gain atomicity
- implement seccomp syscall across all architectures with seccomp filter
- clean up sparse warnings around locking
v5:
- move includes around (drysdale)
- drop set_nnp return value (luto)
- use smp_load_acquire/store_release (luto)
- merge nnp changes to seccomp always, fewer ifdef (luto)
v4:
- cleaned up locking further, as noticed by David Drysdale
v3:
- added SECCOMP_EXT_ACT_FILTER for new filter install options
v2:
- reworked to avoid clone races


2014-06-24 20:48:59

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 7/9] ARM: add seccomp syscall

Wires up the new seccomp syscall.

Signed-off-by: Kees Cook <[email protected]>
---
arch/arm/include/uapi/asm/unistd.h | 1 +
arch/arm/kernel/calls.S | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index ba94446c72d9..e21b4a069701 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -409,6 +409,7 @@
#define __NR_sched_setattr (__NR_SYSCALL_BASE+380)
#define __NR_sched_getattr (__NR_SYSCALL_BASE+381)
#define __NR_renameat2 (__NR_SYSCALL_BASE+382)
+#define __NR_seccomp (__NR_SYSCALL_BASE+383)

/*
* This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 8f51bdcdacbb..bea85f97f363 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -392,6 +392,7 @@
/* 380 */ CALL(sys_sched_setattr)
CALL(sys_sched_getattr)
CALL(sys_renameat2)
+ CALL(sys_seccomp)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
--
1.7.9.5

2014-06-24 20:49:13

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 3/9] seccomp: introduce writer locking

Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to
the seccomp fields are now protected by the sighand spinlock (which
is unique to the thread group). Read access remains lockless because
pointer updates themselves are atomic. However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the tasklist_lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <[email protected]>
---
include/linux/seccomp.h | 6 +++---
kernel/fork.c | 34 +++++++++++++++++++++++++++++++++-
kernel/seccomp.c | 18 +++++++++++++-----
3 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..9ff98b4bfe2e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,11 +14,11 @@ struct seccomp_filter;
*
* @mode: indicates one of the valid values above for controlled
* system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- * are allowed for a task.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ * accessed without locking during system call entry.
*
* @filter must only be accessed from the context of current as there
- * is no locking.
+ * is no read locking.
*/
struct seccomp {
int mode;
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1fc952..d41bdc8700b6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
goto free_ti;

tsk->stack = ti;
+#ifdef CONFIG_SECCOMP
+ /*
+ * We must handle setting up seccomp filters once we're under
+ * the sighand lock in case orig has changed between now and
+ * then. Until then, filter must be NULL to avoid messing up
+ * the usage counts on the error path calling free_task.
+ */
+ tsk->seccomp.filter = NULL;
+#endif

setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
@@ -1081,6 +1090,24 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}

+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+ /*
+ * Must be called with sighand->lock held, which is common to
+ * all threads in the group. Regardless, nothing special is
+ * needed for the child since it is not yet in the tasklist.
+ */
+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+ get_seccomp_filter(current);
+ p->seccomp = current->seccomp;
+
+ if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
+ set_tsk_thread_flag(p, TIF_SECCOMP);
+#endif
+}
+
SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
{
current->clear_child_tid = tidptr;
@@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

ftrace_graph_init_task(p);
- get_seccomp_filter(p);

rt_mutex_init_task(p);

@@ -1437,6 +1463,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
spin_lock(&current->sighand->siglock);

/*
+ * Copy seccomp details explicitly here, in case they were changed
+ * before holding sighand lock.
+ */
+ copy_seccomp(p);
+
+ /*
* Process group and session signals need to be delivered to just the
* parent before the fork or both the parent and the child after the
* fork. Restart if a signal comes in before we add the new process to
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index edc8c79ed16d..405eb72dfe35 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -172,12 +172,12 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
*/
static u32 seccomp_run_filters(int syscall)
{
- struct seccomp_filter *f;
+ struct seccomp_filter *f = smp_load_acquire(&current->seccomp.filter);
struct seccomp_data sd;
u32 ret = SECCOMP_RET_ALLOW;

/* Ensure unexpected behavior doesn't result in failing open. */
- if (WARN_ON(current->seccomp.filter == NULL))
+ if (unlikely(WARN_ON(f == NULL)))
return SECCOMP_RET_KILL;

populate_seccomp_data(&sd);
@@ -186,7 +186,7 @@ static u32 seccomp_run_filters(int syscall)
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
- for (f = current->seccomp.filter; f; f = f->prev) {
+ for (; f; f = f->prev) {
u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);

if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
@@ -312,6 +312,8 @@ out:
* seccomp_attach_filter: validate and attach filter
* @filter: seccomp filter to add to the current process
*
+ * Caller must be holding current->sighand->siglock lock.
+ *
* Returns 0 on success, -ve on error.
*/
static long seccomp_attach_filter(struct seccomp_filter *filter)
@@ -319,6 +321,8 @@ static long seccomp_attach_filter(struct seccomp_filter *filter)
unsigned long total_insns;
struct seccomp_filter *walker;

+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
/* Validate resulting filter length. */
total_insns = filter->prog->len;
for (walker = current->seccomp.filter; walker; walker = walker->prev)
@@ -331,7 +335,7 @@ static long seccomp_attach_filter(struct seccomp_filter *filter)
* task reference.
*/
filter->prev = current->seccomp.filter;
- current->seccomp.filter = filter;
+ smp_store_release(&current->seccomp.filter, filter);

return 0;
}
@@ -357,7 +361,7 @@ static inline void seccomp_filter_free(struct seccomp_filter *filter)
/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
void put_seccomp_filter(struct task_struct *tsk)
{
- struct seccomp_filter *orig = tsk->seccomp.filter;
+ struct seccomp_filter *orig = smp_load_acquire(&tsk->seccomp.filter);
/* Clean up single-reference branches iteratively. */
while (orig && atomic_dec_and_test(&orig->usage)) {
struct seccomp_filter *freeme = orig;
@@ -513,6 +517,7 @@ long prctl_get_seccomp(void)
static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
{
struct seccomp_filter *prepared = NULL;
+ unsigned long irqflags;
long ret = -EINVAL;

#ifdef CONFIG_SECCOMP_FILTER
@@ -524,6 +529,8 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
}
#endif

+ spin_lock_irqsave(&current->sighand->siglock, irqflags);
+
if (current->seccomp.mode &&
current->seccomp.mode != seccomp_mode)
goto out;
@@ -551,6 +558,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
current->seccomp.mode = seccomp_mode;
set_thread_flag(TIF_SECCOMP);
out:
+ spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
seccomp_filter_free(prepared);
return ret;
}
--
1.7.9.5

2014-06-24 20:49:11

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 5/9] seccomp: split mode set routines

Extracts the common check/assign logic, and separates the two mode
setting paths to make things more readable with fewer #ifdefs within
function bodies.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 123 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 84 insertions(+), 39 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index eb8248ab045e..c81000dd0a53 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -194,7 +194,29 @@ static u32 seccomp_run_filters(int syscall)
}
return ret;
}
+#endif /* CONFIG_SECCOMP_FILTER */

+static inline bool seccomp_check_mode(struct task_struct *task,
+ unsigned long seccomp_mode)
+{
+ BUG_ON(!spin_is_locked(&task->sighand->siglock));
+
+ if (task->seccomp.mode && task->seccomp.mode != seccomp_mode)
+ return false;
+
+ return true;
+}
+
+static inline void seccomp_assign_mode(struct task_struct *task,
+ unsigned long seccomp_mode)
+{
+ BUG_ON(!spin_is_locked(&task->sighand->siglock));
+
+ task->seccomp.mode = seccomp_mode;
+ set_tsk_thread_flag(task, TIF_SECCOMP);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
/**
* seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
@@ -501,67 +523,83 @@ long prctl_get_seccomp(void)
}

/**
- * seccomp_set_mode: internal function for setting seccomp mode
- * @seccomp_mode: requested mode to use
- * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ * seccomp_set_mode_strict: internal function for setting strict seccomp
*
- * This function may be called repeatedly with a @seccomp_mode of
- * SECCOMP_MODE_FILTER to install additional filters. Every filter
- * successfully installed will be evaluated (in reverse order) for each system
- * call the task makes.
+ * Once current->seccomp.mode is non-zero, it may not be changed.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+static long seccomp_set_mode_strict(void)
+{
+ const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
+ unsigned long irqflags;
+ int ret = -EINVAL;
+
+ spin_lock_irqsave(&current->sighand->siglock, irqflags);
+
+ if (!seccomp_check_mode(current, seccomp_mode))
+ goto out;
+
+#ifdef TIF_NOTSC
+ disable_TSC();
+#endif
+ seccomp_assign_mode(current, seccomp_mode);
+ ret = 0;
+
+out:
+ spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
+
+ return ret;
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
+/**
+ * seccomp_set_mode_filter: internal function for setting seccomp filter
+ * @filter: struct sock_fprog containing filter
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the task makes.
*
* Once current->seccomp.mode is non-zero, it may not be changed.
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode_filter(char __user *filter)
{
+ const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
unsigned long irqflags;
long ret = -EINVAL;

-#ifdef CONFIG_SECCOMP_FILTER
- /* Prepare the new filter outside of the seccomp lock. */
- if (seccomp_mode == SECCOMP_MODE_FILTER) {
- prepared = seccomp_prepare_user_filter(filter);
- if (IS_ERR(prepared))
- return PTR_ERR(prepared);
- }
-#endif
+ /* Prepare the new filter outside of any locking. */
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);

spin_lock_irqsave(&current->sighand->siglock, irqflags);

- if (current->seccomp.mode &&
- current->seccomp.mode != seccomp_mode)
+ if (!seccomp_check_mode(current, seccomp_mode))
goto out;

- switch (seccomp_mode) {
- case SECCOMP_MODE_STRICT:
- ret = 0;
-#ifdef TIF_NOTSC
- disable_TSC();
-#endif
- break;
-#ifdef CONFIG_SECCOMP_FILTER
- case SECCOMP_MODE_FILTER:
- ret = seccomp_attach_filter(prepared);
- if (ret)
- goto out;
- /* Do not free the successfully attached filter. */
- prepared = NULL;
- break;
-#endif
- default:
+ ret = seccomp_attach_filter(prepared);
+ if (ret)
goto out;
- }
+ /* Do not free the successfully attached filter. */
+ prepared = NULL;

- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ seccomp_assign_mode(current, seccomp_mode);
out:
spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
seccomp_filter_free(prepared);
return ret;
}
+#else
+static inline long seccomp_set_mode_filter(char __user *filter)
+{
+ return -EINVAL;
+}
+#endif

/**
* prctl_set_seccomp: configures current->seccomp.mode
@@ -572,5 +610,12 @@ out:
*/
long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
- return seccomp_set_mode(seccomp_mode, filter);
+ switch (seccomp_mode) {
+ case SECCOMP_MODE_STRICT:
+ return seccomp_set_mode_strict();
+ case SECCOMP_MODE_FILTER:
+ return seccomp_set_mode_filter(filter);
+ default:
+ return -EINVAL;
+ }
}
--
1.7.9.5

2014-06-24 20:49:09

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions are against another thread requesting TSYNC, another
thread performing a clone, and another thread changing its filter. The
sighand lock is sufficient for these cases. The clone case is assisted
by the fact that new threads will have their seccomp state duplicated
from their parent before appearing on the tasklist.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
---
include/linux/seccomp.h | 2 +
include/uapi/linux/seccomp.h | 3 ++
kernel/seccomp.c | 115 +++++++++++++++++++++++++++++++++++++++++-
3 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9ff98b4bfe2e..15de2a711518 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,6 +3,8 @@

#include <uapi/linux/seccomp.h>

+#define SECCOMP_FILTER_FLAG_MASK ~(SECCOMP_FILTER_FLAG_TSYNC)
+
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index b258878ba754..0f238a43ff1e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -14,6 +14,9 @@
#define SECCOMP_SET_MODE_STRICT 0
#define SECCOMP_SET_MODE_FILTER 1

+/* Valid flags for SECCOMP_SET_MODE_FILTER */
+#define SECCOMP_FILTER_FLAG_TSYNC 1
+
/*
* All BPF programs must return a 32-bit value.
* The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 051c3da4fb4c..c58b49375125 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,7 @@
#ifdef CONFIG_SECCOMP_FILTER
#include <asm/syscall.h>
#include <linux/filter.h>
+#include <linux/pid.h>
#include <linux/ptrace.h>
#include <linux/security.h>
#include <linux/tracehook.h>
@@ -218,6 +219,105 @@ static inline void seccomp_assign_mode(struct task_struct *task,
}

#ifdef CONFIG_SECCOMP_FILTER
+/* Returns 1 if the candidate is an ancestor. */
+static int is_ancestor(struct seccomp_filter *candidate,
+ struct seccomp_filter *child)
+{
+ /* NULL is the root ancestor. */
+ if (candidate == NULL)
+ return 1;
+ for (; child; child = child->prev)
+ if (child == candidate)
+ return 1;
+ return 0;
+}
+
+/**
+ * seccomp_can_sync_threads: checks if all threads can be synchronized
+ *
+ * Expects current->sighand->siglock to be held.
+ *
+ * Returns 0 on success, -ve on error, or the pid of a thread which was
+ * either not in the correct seccomp mode or it did not have an ancestral
+ * seccomp filter.
+ */
+static pid_t seccomp_can_sync_threads(void)
+{
+ struct task_struct *thread, *caller;
+
+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+ if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+ return -EACCES;
+
+ /* Validate all threads being eligible for synchronization. */
+ caller = current;
+ for_each_thread(caller, thread) {
+ pid_t failed;
+
+ if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
+ (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
+ is_ancestor(thread->seccomp.filter,
+ caller->seccomp.filter)))
+ continue;
+
+ /* Return the first thread that cannot be synchronized. */
+ failed = task_pid_vnr(thread);
+ /* If the pid cannot be resolved, then return -ESRCH */
+ if (unlikely(WARN_ON(failed == 0)))
+ failed = -ESRCH;
+ return failed;
+ }
+
+ return 0;
+}
+
+/**
+ * seccomp_sync_threads: sets all threads to use current's filter
+ *
+ * Expects current->sighand->siglock to be held,
+ * and for seccomp_can_sync_threads() to have returned success already
+ * without dropping the locks.
+ *
+ */
+static void seccomp_sync_threads(void)
+{
+ struct task_struct *thread, *caller;
+
+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+ /* Synchronize all threads. */
+ caller = current;
+ for_each_thread(caller, thread) {
+ /* Get a task reference for the new leaf node. */
+ get_seccomp_filter(caller);
+ /*
+ * Drop the task reference to the shared ancestor since
+ * current's path will hold a reference. (This also
+ * allows a put before the assignment.)
+ */
+ put_seccomp_filter(thread);
+ thread->seccomp.filter = caller->seccomp.filter;
+ /* Opt the other thread into seccomp if needed.
+ * As threads are considered to be trust-realm
+ * equivalent (see ptrace_may_access), it is safe to
+ * allow one thread to transition the other.
+ */
+ if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
+ /*
+ * Don't let an unprivileged task work around
+ * the no_new_privs restriction by creating
+ * a thread that sets it up, enters seccomp,
+ * then dies.
+ */
+ if (task_no_new_privs(caller))
+ task_set_no_new_privs(thread);
+
+ seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
+ }
+ }
+}
+
/**
* seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
@@ -349,7 +449,7 @@ static long seccomp_attach_filter(unsigned int flags,
BUG_ON(!spin_is_locked(&current->sighand->siglock));

/* Validate flags. */
- if (flags != 0)
+ if (flags & SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;

/* Validate resulting filter length. */
@@ -359,6 +459,15 @@ static long seccomp_attach_filter(unsigned int flags,
if (total_insns > MAX_INSNS_PER_PATH)
return -ENOMEM;

+ /* If thread sync has been requested, check that it is possible. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ int ret;
+
+ ret = seccomp_can_sync_threads();
+ if (ret)
+ return ret;
+ }
+
/*
* If there is an existing filter, make it the prev and don't drop its
* task reference.
@@ -366,6 +475,10 @@ static long seccomp_attach_filter(unsigned int flags,
filter->prev = current->seccomp.filter;
smp_store_release(&current->seccomp.filter, filter);

+ /* Now that the new filter is in place, synchronize to all threads. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ seccomp_sync_threads();
+
return 0;
}

--
1.7.9.5

2014-06-24 20:49:06

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 8/9] MIPS: add seccomp syscall

Wires up the new seccomp syscall.

Signed-off-by: Kees Cook <[email protected]>
---
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index 5805414777e0..9bc13eaf9d67 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -372,16 +372,17 @@
#define __NR_sched_setattr (__NR_Linux + 349)
#define __NR_sched_getattr (__NR_Linux + 350)
#define __NR_renameat2 (__NR_Linux + 351)
+#define __NR_seccomp (__NR_Linux + 352)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 351
+#define __NR_Linux_syscalls 352

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 351
+#define __NR_O32_Linux_syscalls 352

#if _MIPS_SIM == _MIPS_SIM_ABI64

@@ -701,16 +702,17 @@
#define __NR_sched_setattr (__NR_Linux + 309)
#define __NR_sched_getattr (__NR_Linux + 310)
#define __NR_renameat2 (__NR_Linux + 311)
+#define __NR_seccomp (__NR_Linux + 312)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 311
+#define __NR_Linux_syscalls 312

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 311
+#define __NR_64_Linux_syscalls 312

#if _MIPS_SIM == _MIPS_SIM_NABI32

@@ -1034,15 +1036,16 @@
#define __NR_sched_setattr (__NR_Linux + 313)
#define __NR_sched_getattr (__NR_Linux + 314)
#define __NR_renameat2 (__NR_Linux + 315)
+#define __NR_seccomp (__NR_Linux + 316)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 315
+#define __NR_Linux_syscalls 316

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 315
+#define __NR_N32_Linux_syscalls 316

#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 3245474f19d5..ab02d14f1b5c 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -578,3 +578,4 @@ EXPORT(sys_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr /* 4350 */
PTR sys_renameat2
+ PTR sys_seccomp
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index be2fedd4ae33..010dccf128ec 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -431,4 +431,5 @@ EXPORT(sys_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr /* 5310 */
PTR sys_renameat2
+ PTR sys_seccomp
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index c1dbcda4b816..c3b3b6525df5 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -424,4 +424,5 @@ EXPORT(sysn32_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr
PTR sys_renameat2 /* 6315 */
+ PTR sys_seccomp
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index f1343ccd7ed7..bb1550b1f501 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -557,4 +557,5 @@ EXPORT(sys32_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr /* 4350 */
PTR sys_renameat2
+ PTR sys_seccomp
.size sys32_call_table,.-sys32_call_table
--
1.7.9.5

2014-06-24 20:50:22

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 6/9] seccomp: add "seccomp" syscall

This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag in the following patch, there is a non-zero
chance that this syscall could be used for configuring a fixed argument
area for seccomp-tracer-aware processes to pass syscall arguments in
the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook <[email protected]>
Cc: [email protected]
---
arch/Kconfig | 1 +
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/unistd.h | 4 ++-
include/uapi/linux/seccomp.h | 4 +++
kernel/seccomp.c | 63 ++++++++++++++++++++++++++++++++-----
kernel/sys_ni.c | 3 ++
8 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 97ff872c7acc..0eae9df35b88 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -321,6 +321,7 @@ config HAVE_ARCH_SECCOMP_FILTER
- secure_computing is called from a ptrace_event()-safe context
- secure_computing return value is checked and a return value of -1
results in the system call being skipped immediately.
+ - seccomp syscall wired up

config SECCOMP_FILTER
def_bool y
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b867921612..7527eac24122 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
353 i386 renameat2 sys_renameat2
+354 i386 seccomp sys_seccomp
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..16272a6c12b7 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common seccomp sys_seccomp

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..1713977ee26f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
+ const char __user *uargs);
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 333640608087..65acbf0e2867 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
__SYSCALL(__NR_sched_getattr, sys_sched_getattr)
#define __NR_renameat2 276
__SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_seccomp 277
+__SYSCALL(__NR_seccomp, sys_seccomp)

#undef __NR_syscalls
-#define __NR_syscalls 277
+#define __NR_syscalls 278

/*
* All syscalls below here should go away really,
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index ac2dc9f72973..b258878ba754 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,6 +10,10 @@
#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */

+/* Valid operations for seccomp syscall. */
+#define SECCOMP_SET_MODE_STRICT 0
+#define SECCOMP_SET_MODE_FILTER 1
+
/*
* All BPF programs must return a 32-bit value.
* The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index c81000dd0a53..051c3da4fb4c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -19,6 +19,7 @@
#include <linux/sched.h>
#include <linux/seccomp.h>
#include <linux/slab.h>
+#include <linux/syscalls.h>

/* #define SECCOMP_DEBUG 1 */

@@ -308,8 +309,8 @@ free_prog:
*
* Returns filter on success and ERR_PTR otherwise.
*/
-static
-struct seccomp_filter *seccomp_prepare_user_filter(char __user *user_filter)
+static struct seccomp_filter *
+seccomp_prepare_user_filter(const char __user *user_filter)
{
struct sock_fprog fprog;
struct seccomp_filter *filter = ERR_PTR(-EFAULT);
@@ -332,19 +333,25 @@ out:

/**
* seccomp_attach_filter: validate and attach filter
+ * @flags: flags to change filter behavior
* @filter: seccomp filter to add to the current process
*
* Caller must be holding current->sighand->siglock lock.
*
* Returns 0 on success, -ve on error.
*/
-static long seccomp_attach_filter(struct seccomp_filter *filter)
+static long seccomp_attach_filter(unsigned int flags,
+ struct seccomp_filter *filter)
{
unsigned long total_insns;
struct seccomp_filter *walker;

BUG_ON(!spin_is_locked(&current->sighand->siglock));

+ /* Validate flags. */
+ if (flags != 0)
+ return -EINVAL;
+
/* Validate resulting filter length. */
total_insns = filter->prog->len;
for (walker = current->seccomp.filter; walker; walker = walker->prev)
@@ -555,6 +562,7 @@ out:
#ifdef CONFIG_SECCOMP_FILTER
/**
* seccomp_set_mode_filter: internal function for setting seccomp filter
+ * @flags: flags to change filter behavior
* @filter: struct sock_fprog containing filter
*
* This function may be called repeatedly to install additional filters.
@@ -565,7 +573,8 @@ out:
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode_filter(char __user *filter)
+static long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
@@ -582,7 +591,7 @@ static long seccomp_set_mode_filter(char __user *filter)
if (!seccomp_check_mode(current, seccomp_mode))
goto out;

- ret = seccomp_attach_filter(prepared);
+ ret = seccomp_attach_filter(flags, prepared);
if (ret)
goto out;
/* Do not free the successfully attached filter. */
@@ -595,12 +604,35 @@ out:
return ret;
}
#else
-static inline long seccomp_set_mode_filter(char __user *filter)
+static inline long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
{
return -EINVAL;
}
#endif

+/* Common entry point for both prctl and syscall. */
+static long do_seccomp(unsigned int op, unsigned int flags,
+ const char __user *uargs)
+{
+ switch (op) {
+ case SECCOMP_SET_MODE_STRICT:
+ if (flags != 0 || uargs != NULL)
+ return -EINVAL;
+ return seccomp_set_mode_strict();
+ case SECCOMP_SET_MODE_FILTER:
+ return seccomp_set_mode_filter(flags, uargs);
+ default:
+ return -EINVAL;
+ }
+}
+
+SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags,
+ const char __user *, uargs)
+{
+ return do_seccomp(op, flags, uargs);
+}
+
/**
* prctl_set_seccomp: configures current->seccomp.mode
* @seccomp_mode: requested mode to use
@@ -610,12 +642,27 @@ static inline long seccomp_set_mode_filter(char __user *filter)
*/
long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
+ unsigned int op;
+ char __user *uargs;
+
switch (seccomp_mode) {
case SECCOMP_MODE_STRICT:
- return seccomp_set_mode_strict();
+ op = SECCOMP_SET_MODE_STRICT;
+ /*
+ * Setting strict mode through prctl always ignored filter,
+ * so make sure it is always NULL here to pass the internal
+ * check in do_seccomp().
+ */
+ uargs = NULL;
+ break;
case SECCOMP_MODE_FILTER:
- return seccomp_set_mode_filter(filter);
+ op = SECCOMP_SET_MODE_FILTER;
+ uargs = filter;
+ break;
default:
return -EINVAL;
}
+
+ /* prctl interface doesn't have flags, so they are always zero. */
+ return do_seccomp(op, 0, uargs);
}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b51b5df..2904a2105914 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -213,3 +213,6 @@ cond_syscall(compat_sys_open_by_handle_at);

/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* operate on Secure Computing state */
+cond_syscall(sys_seccomp);
--
1.7.9.5

2014-06-24 20:50:47

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 2/9] seccomp: split filter prep from check and apply

In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 97 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 67 insertions(+), 30 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index afb916c7e890..edc8c79ed16d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -18,6 +18,7 @@
#include <linux/compat.h>
#include <linux/sched.h>
#include <linux/seccomp.h>
+#include <linux/slab.h>

/* #define SECCOMP_DEBUG 1 */

@@ -26,7 +27,6 @@
#include <linux/filter.h>
#include <linux/ptrace.h>
#include <linux/security.h>
-#include <linux/slab.h>
#include <linux/tracehook.h>
#include <linux/uaccess.h>

@@ -196,27 +196,21 @@ static u32 seccomp_run_filters(int syscall)
}

/**
- * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
*
- * Returns 0 on success or an errno on failure.
+ * Returns filter on success or an ERR_PTR on failure.
*/
-static long seccomp_attach_filter(struct sock_fprog *fprog)
+static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
struct seccomp_filter *filter;
unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
- unsigned long total_insns = fprog->len;
struct sock_filter *fp;
int new_len;
long ret;

if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
- return -EINVAL;
-
- for (filter = current->seccomp.filter; filter; filter = filter->prev)
- total_insns += filter->prog->len + 4; /* include a 4 instr penalty */
- if (total_insns > MAX_INSNS_PER_PATH)
- return -ENOMEM;
+ return ERR_PTR(-EINVAL);

/*
* Installing a seccomp filter requires that the task has
@@ -227,11 +221,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
if (!current->no_new_privs &&
security_capable_noaudit(current_cred(), current_user_ns(),
CAP_SYS_ADMIN) != 0)
- return -EACCES;
+ return ERR_PTR(-EACCES);

fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
if (!fp)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);

/* Copy the instructions from fprog. */
ret = -EFAULT;
@@ -275,13 +269,7 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)

sk_filter_select_runtime(filter->prog);

- /*
- * If there is an existing filter, make it the prev and don't drop its
- * task reference.
- */
- filter->prev = current->seccomp.filter;
- current->seccomp.filter = filter;
- return 0;
+ return filter;

free_filter_prog:
kfree(filter->prog);
@@ -289,19 +277,20 @@ free_filter:
kfree(filter);
free_prog:
kfree(fp);
- return ret;
+ return ERR_PTR(ret);
}

/**
- * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * seccomp_prepare_user_filter - prepares a user-supplied sock_fprog
* @user_filter: pointer to the user data containing a sock_fprog.
*
- * Returns 0 on success and non-zero otherwise.
+ * Returns filter on success and ERR_PTR otherwise.
*/
-static long seccomp_attach_user_filter(char __user *user_filter)
+static
+struct seccomp_filter *seccomp_prepare_user_filter(char __user *user_filter)
{
struct sock_fprog fprog;
- long ret = -EFAULT;
+ struct seccomp_filter *filter = ERR_PTR(-EFAULT);

#ifdef CONFIG_COMPAT
if (is_compat_task()) {
@@ -314,9 +303,37 @@ static long seccomp_attach_user_filter(char __user *user_filter)
#endif
if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
goto out;
- ret = seccomp_attach_filter(&fprog);
+ filter = seccomp_prepare_filter(&fprog);
out:
- return ret;
+ return filter;
+}
+
+/**
+ * seccomp_attach_filter: validate and attach filter
+ * @filter: seccomp filter to add to the current process
+ *
+ * Returns 0 on success, -ve on error.
+ */
+static long seccomp_attach_filter(struct seccomp_filter *filter)
+{
+ unsigned long total_insns;
+ struct seccomp_filter *walker;
+
+ /* Validate resulting filter length. */
+ total_insns = filter->prog->len;
+ for (walker = current->seccomp.filter; walker; walker = walker->prev)
+ total_insns += walker->prog->len + 4; /* 4 instr penalty */
+ if (total_insns > MAX_INSNS_PER_PATH)
+ return -ENOMEM;
+
+ /*
+ * If there is an existing filter, make it the prev and don't drop its
+ * task reference.
+ */
+ filter->prev = current->seccomp.filter;
+ current->seccomp.filter = filter;
+
+ return 0;
}

/* get_seccomp_filter - increments the reference count of the filter on @tsk */
@@ -329,6 +346,14 @@ void get_seccomp_filter(struct task_struct *tsk)
atomic_inc(&orig->usage);
}

+static inline void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (filter) {
+ sk_filter_free(filter->prog);
+ kfree(filter);
+ }
+}
+
/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
void put_seccomp_filter(struct task_struct *tsk)
{
@@ -337,8 +362,7 @@ void put_seccomp_filter(struct task_struct *tsk)
while (orig && atomic_dec_and_test(&orig->usage)) {
struct seccomp_filter *freeme = orig;
orig = orig->prev;
- sk_filter_free(freeme->prog);
- kfree(freeme);
+ seccomp_filter_free(freeme);
}
}

@@ -488,8 +512,18 @@ long prctl_get_seccomp(void)
*/
static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
{
+ struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;

+#ifdef CONFIG_SECCOMP_FILTER
+ /* Prepare the new filter outside of the seccomp lock. */
+ if (seccomp_mode == SECCOMP_MODE_FILTER) {
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);
+ }
+#endif
+
if (current->seccomp.mode &&
current->seccomp.mode != seccomp_mode)
goto out;
@@ -503,9 +537,11 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
break;
#ifdef CONFIG_SECCOMP_FILTER
case SECCOMP_MODE_FILTER:
- ret = seccomp_attach_user_filter(filter);
+ ret = seccomp_attach_filter(prepared);
if (ret)
goto out;
+ /* Do not free the successfully attached filter. */
+ prepared = NULL;
break;
#endif
default:
@@ -515,6 +551,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
current->seccomp.mode = seccomp_mode;
set_thread_flag(TIF_SECCOMP);
out:
+ seccomp_filter_free(prepared);
return ret;
}

--
1.7.9.5

2014-06-24 20:51:02

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 4/9] sched: move no_new_privs into new atomic flags

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook <[email protected]>
---
fs/exec.c | 4 ++--
include/linux/sched.h | 16 ++++++++++++++--
kernel/seccomp.c | 2 +-
kernel/sys.c | 4 ++--
security/apparmor/domain.c | 4 ++--
5 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a3d33fe592d6..0f5c272410f6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1234,7 +1234,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
* This isn't strictly necessary, but it makes it harder for LSMs to
* mess up.
*/
- if (current->no_new_privs)
+ if (task_no_new_privs(current))
bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;

t = p;
@@ -1272,7 +1272,7 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->egid = current_egid();

if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
- !current->no_new_privs &&
+ !task_no_new_privs(current) &&
kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
/* Set-uid? */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..0c6917fbd8d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1307,8 +1307,7 @@ struct task_struct {
* execve */
unsigned in_iowait:1;

- /* task may not gain privileges */
- unsigned no_new_privs:1;
+ unsigned long atomic_flags; /* Flags needing atomic access. */

/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
@@ -1967,6 +1966,19 @@ static inline void memalloc_noio_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+/* Per-process atomic flags. */
+#define PFA_NO_NEW_PRIVS 0x00000001 /* May not gain new privileges. */
+
+static inline bool task_no_new_privs(struct task_struct *p)
+{
+ return test_bit(PFA_NO_NEW_PRIVS, &p->atomic_flags);
+}
+
+static inline void task_set_no_new_privs(struct task_struct *p)
+{
+ set_bit(PFA_NO_NEW_PRIVS, &p->atomic_flags);
+}
+
/*
* task->jobctl flags
*/
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 405eb72dfe35..eb8248ab045e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -218,7 +218,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
* This avoids scenarios where unprivileged tasks can affect the
* behavior of privileged children.
*/
- if (!current->no_new_privs &&
+ if (!task_no_new_privs(current) &&
security_capable_noaudit(current_cred(), current_user_ns(),
CAP_SYS_ADMIN) != 0)
return ERR_PTR(-EACCES);
diff --git a/kernel/sys.c b/kernel/sys.c
index 66a751ebf9d9..ce8129192a26 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1990,12 +1990,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 != 1 || arg3 || arg4 || arg5)
return -EINVAL;

- current->no_new_privs = 1;
+ task_set_no_new_privs(current);
break;
case PR_GET_NO_NEW_PRIVS:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
- return current->no_new_privs ? 1 : 0;
+ return task_no_new_privs(current) ? 1 : 0;
case PR_GET_THP_DISABLE:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 452567d3a08e..d97cba3e3849 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -621,7 +621,7 @@ int aa_change_hat(const char *hats[], int count, u64 token, bool permtest)
* There is no exception for unconfined as change_hat is not
* available.
*/
- if (current->no_new_privs)
+ if (task_no_new_privs(current))
return -EPERM;

/* released below */
@@ -776,7 +776,7 @@ int aa_change_profile(const char *ns_name, const char *hname, bool onexec,
* no_new_privs is set because this aways results in a reduction
* of permissions.
*/
- if (current->no_new_privs && !unconfined(profile)) {
+ if (task_no_new_privs(current) && !unconfined(profile)) {
put_cred(cred);
return -EPERM;
}
--
1.7.9.5

2014-06-24 20:51:36

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 1/9] seccomp: create internal mode-setting function

In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 301bbc24739c..afb916c7e890 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -473,7 +473,7 @@ long prctl_get_seccomp(void)
}

/**
- * prctl_set_seccomp: configures current->seccomp.mode
+ * seccomp_set_mode: internal function for setting seccomp mode
* @seccomp_mode: requested mode to use
* @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
*
@@ -486,7 +486,7 @@ long prctl_get_seccomp(void)
*
* Returns 0 on success or -EINVAL on failure.
*/
-long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
{
long ret = -EINVAL;

@@ -517,3 +517,15 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
out:
return ret;
}
+
+/**
+ * prctl_set_seccomp: configures current->seccomp.mode
+ * @seccomp_mode: requested mode to use
+ * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+{
+ return seccomp_set_mode(seccomp_mode, filter);
+}
--
1.7.9.5

2014-06-24 20:57:02

by Kees Cook

[permalink] [raw]
Subject: [PATCH v8 1/1] man-pages: seccomp.2: document syscall

Combines documentation from prctl, in-kernel seccomp_filter.txt and
dropper.c, along with details specific to the new syscall.

Signed-off-by: Kees Cook <[email protected]>
---
v3:
- change args to void * (luto)
- small typo cleanups
v2:
- add full example code, based on "dropper.c" in samples/seccomp/
---
man2/seccomp.2 | 400 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 400 insertions(+)
create mode 100644 man2/seccomp.2

diff --git a/man2/seccomp.2 b/man2/seccomp.2
new file mode 100644
index 0000000..f64950f
--- /dev/null
+++ b/man2/seccomp.2
@@ -0,0 +1,400 @@
+.\" Copyright (C) 2014 Kees Cook <[email protected]>
+.\" and Copyright (C) 2012 Will Drewry <[email protected]>
+.\" and Copyright (C) 2008 Michael Kerrisk <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
+.SH NAME
+seccomp \-
+operate on Secure Computing state of the process
+.SH SYNOPSIS
+.nf
+.B #include <linux/seccomp.h>
+.B #include <linux/filter.h>
+.B #include <linux/audit.h>
+.B #include <linux/signal.h>
+.B #include <sys/ptrace.h>
+
+.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
+.BI " void *" args );
+.fi
+.SH DESCRIPTION
+The
+.BR seccomp ()
+system call operates on the Secure Computing (seccomp) state of the
+current process.
+
+Currently, Linux supports the following
+.IR operation
+values:
+.TP
+.BR SECCOMP_SET_MODE_STRICT
+Only system calls that the thread is permitted to make are
+.BR read (2),
+.BR write (2),
+.BR _exit (2),
+and
+.BR sigreturn (2).
+Other system calls result in the delivery of a
+.BR SIGKILL
+signal. Strict secure computing mode is useful for number-crunching
+applications that may need to execute untrusted byte code, perhaps
+obtained by reading from a pipe or socket.
+
+This operation is available only if the kernel is configured with
+.BR CONFIG_SECCOMP
+enabled.
+
+The value of
+.IR flags
+must be 0, and
+.IR args
+must be NULL.
+
+This operation is functionally identical to calling
+.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" .
+.TP
+.BR SECCOMP_SET_MODE_FILTER
+The system calls allowed are defined by a pointer to a Berkeley Packet
+Filter (BPF) passed via
+.IR args .
+This argument is a pointer to
+.IR "struct\ sock_fprog" ;
+it can be designed to filter arbitrary system calls and system call
+arguments. If the filter is invalid, the call will fail, returning
+.BR EACCESS
+in
+.IR errno .
+
+If
+.BR fork (2),
+.BR clone (2),
+or
+.BR execve (2)
+are allowed by the filter, any child processes will be constrained to
+the same filters and system calls as the parent.
+
+Prior to using this operation, the process must call
+.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)"
+or run with
+.BR CAP_SYS_ADMIN
+privileges in its namespace. If these are not true, the call will fail
+and return
+.BR EACCES
+in
+.IR errno .
+This requirement ensures that filter programs cannot be applied to child
+processes with greater privileges than the process that installed them.
+
+Additionally, if
+.BR prctl (2)
+or
+.BR seccomp (2)
+is allowed by the attached filter, additional filters may be layered on
+which will increase evaluation time, but allow for further reduction of
+the attack surface during execution of a process.
+
+This operation is available only if the kernel is configured with
+.BR CONFIG_SECCOMP_FILTER
+enabled.
+
+When
+.IR flags
+are 0, this operation is functionally identical to calling
+.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" .
+
+The recognized
+.IR flags
+are:
+.RS
+.TP
+.BR SECCOMP_FILTER_FLAG_TSYNC
+When adding a new filter, synchronize all other threads of the current
+process to the same seccomp filter tree. If any thread cannot do this,
+the call will not attach the new seccomp filter, and will fail returning
+the first thread ID found that cannot synchronize. Synchronization will
+fail if another thread is in
+.BR SECCOMP_MODE_STRICT
+or if it has attached new seccomp filters to itself, diverging from the
+calling thread's filter tree.
+.RE
+.SH FILTERS
+When adding filters via
+.BR SECCOMP_SET_MODE_FILTER ,
+.IR args
+points to a filter program:
+
+.in +4n
+.nf
+struct sock_fprog {
+ unsigned short len; /* Number of BPF instructions */
+ struct sock_filter *filter;
+};
+.fi
+.in
+
+Each program must contain one or more BPF instructions:
+
+.in +4n
+.nf
+struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+};
+.fi
+.in
+
+When executing the instructions, the BPF program executes over the
+syscall information made available via:
+
+.in +4n
+.nf
+struct seccomp_data {
+ int nr; /* system call number */
+ __u32 arch; /* AUDIT_ARCH_* value */
+ __u64 instruction_pointer; /* CPU instruction pointer */
+ __u64 args[6]; /* up to 6 system call arguments */
+};
+.fi
+.in
+
+A seccomp filter may return any of the following values. If multiple
+filters exist, the return value for the evaluation of a given system
+call will always use the highest precedent value. (For example,
+.BR SECCOMP_RET_KILL
+will always take precedence.)
+
+In precedence order, they are:
+.TP
+.BR SECCOMP_RET_KILL
+Results in the task exiting immediately without executing the
+system call. The exit status of the task (status & 0x7f) will
+be
+.BR SIGSYS ,
+not
+.BR SIGKILL .
+.TP
+.BR SECCOMP_RET_TRAP
+Results in the kernel sending a
+.BR SIGSYS
+signal to the triggering task without executing the system call.
+.IR siginfo\->si_call_addr
+will show the address of the system call instruction, and
+.IR siginfo\->si_syscall
+and
+.IR siginfo\->si_arch
+will indicate which syscall was attempted. The program counter will be
+as though the syscall happened (i.e. it will not point to the syscall
+instruction). The return value register will contain an arch\-dependent
+value; if resuming execution, set it to something sensible.
+(The architecture dependency is because replacing it with
+.BR ENOSYS
+could overwrite some useful information.)
+
+The
+.BR SECCOMP_RET_DATA
+portion of the return value will be passed as
+.IR si_errno .
+
+.BR SIGSYS
+triggered by seccomp will have a
+.IR si_code
+of
+.BR SYS_SECCOMP .
+.TP
+.BR SECCOMP_RET_ERRNO
+Results in the lower 16-bits of the return value being passed
+to userland as the
+.IR errno
+without executing the system call.
+.TP
+.BR SECCOMP_RET_TRACE
+When returned, this value will cause the kernel to attempt to
+notify a ptrace()-based tracer prior to executing the system
+call. If there is no tracer present,
+.BR ENOSYS
+is returned to userland and the system call is not executed.
+
+A tracer will be notified if it requests
+.BR PTRACE_O_TRACESECCOMP
+using
+.IR ptrace(PTRACE_SETOPTIONS) .
+The tracer will be notified of a
+.BR PTRACE_EVENT_SECCOMP
+and the
+.BR SECCOMP_RET_DATA
+portion of the BPF program return value will be available to the tracer
+via
+.BR PTRACE_GETEVENTMSG .
+
+The tracer can skip the system call by changing the syscall number
+to \-1. Alternatively, the tracer can change the system call
+requested by changing the system call to a valid syscall number. If
+the tracer asks to skip the system call, then the system call will
+appear to return the value that the tracer puts in the return value
+register.
+
+The seccomp check will not be run again after the tracer is
+notified. (This means that seccomp-based sandboxes MUST NOT
+allow use of ptrace, even of other sandboxed processes, without
+extreme care; ptracers can use this mechanism to escape.)
+.TP
+.BR SECCOMP_RET_ALLOW
+Results in the system call being executed.
+
+If multiple filters exist, the return value for the evaluation of a
+given system call will always use the highest precedent value.
+
+Precedence is only determined using the
+.BR SECCOMP_RET_ACTION
+mask. When multiple filters return values of the same precedence,
+only the
+.BR SECCOMP_RET_DATA
+from the most recently installed filter will be returned.
+.SH RETURN VALUE
+On success,
+.BR seccomp ()
+returns 0.
+On error, if
+.BR SECCOMP_FILTER_FLAG_TSYNC
+was used, the return value is the thread ID that caused the
+synchronization failure. On other errors, \-1 is returned, and
+.IR errno
+is set to indicate the cause of the error.
+.SH ERRORS
+.BR seccomp ()
+can fail for the following reasons:
+.TP
+.BR EACCESS
+the caller did not have the
+.BR CAP_SYS_ADMIN
+capability, or had not set
+.IR no_new_privs
+before using
+.BR SECCOMP_SET_MODE_FILTER .
+.TP
+.BR EFAULT
+.IR args
+was required to be a valid address.
+.TP
+.BR EINVAL
+.IR operation
+is unknown; or
+.IR flags
+are invalid for the given
+.IR operation
+.TP
+.BR ESRCH
+Another thread caused a failure during thread sync, but its ID could not
+be determined.
+.SH VERSIONS
+This system call first appeared in Linux 3.16.
+.\" FIXME Add glibc version
+.SH CONFORMING TO
+This system call is a nonstandard Linux extension.
+.SH NOTES
+.BR seccomp ()
+provides a superset of the functionality provided by
+.IR PR_SET_SECCOMP
+of
+.BR prctl (2) .
+(Which does not support
+.IR flags .)
+.SH EXAMPLE
+.nf
+#include <errno.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <linux/audit.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <sys/prctl.h>
+
+static int install_filter(int syscall, int arch, int error)
+{
+ struct sock_filter filter[] = {
+ /* Load architecture. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, arch))),
+ /* Jump forward 4 instructions on architecture mismatch. */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 4),
+ /* Load syscall number. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+ /* Jump forward 1 instruction on syscall mismatch. */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, syscall, 0, 1),
+ /* Matching arch and syscall: return specific errno. */
+ BPF_STMT(BPF_RET+BPF_K,
+ SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)),
+ /* Destination of syscall mismatch: Allow other syscalls. */
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ /* Destination of arch mismatch: Kill process. */
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
+ perror("seccomp");
+ return EXIT_FAILURE;
+ }
+ return EXIT_SUCCESS;
+}
+
+int main(int argc, char **argv)
+{
+ if (argc < 5) {
+ fprintf(stderr, "Usage:\\n"
+ "refuse <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
+ "Hint: AUDIT_ARCH_I386: 0x%X\\n"
+ " AUDIT_ARCH_X86_64: 0x%X\\n"
+ "\\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
+ return EXIT_FAILURE;
+ }
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl");
+ return EXIT_FAILURE;
+ }
+ if (install_filter(strtol(argv[1], NULL, 0),
+ strtol(argv[2], NULL, 0),
+ strtol(argv[3], NULL, 0)))
+ return EXIT_FAILURE;
+ execv(argv[4], &argv[4]);
+ perror("execv");
+ return EXIT_FAILURE;
+}
+.fi
+.SH SEE ALSO
+.ad l
+.nh
+.BR prctl (2),
+.BR ptrace (2),
+.BR signal (7),
+.BR socket (7)
+.ad
--
1.7.9.5



--
Kees Cook @outflux.net

2014-06-25 13:06:31

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH v8 1/1] man-pages: seccomp.2: document syscall

On Tue, 24 Jun 2014 13:56:15 -0700
Kees Cook <[email protected]> wrote:

> Combines documentation from prctl, in-kernel seccomp_filter.txt and
> dropper.c, along with details specific to the new syscall.

What is the license on the example ? Probably you want to propogate the
minimal form of the text in seccomp/dropper into the document example to
avoid confusion ?

Alan

2014-06-25 13:46:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 4/9] sched: move no_new_privs into new atomic flags

On 06/24, Kees Cook wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1307,8 +1307,7 @@ struct task_struct {
> * execve */
> unsigned in_iowait:1;
>
> - /* task may not gain privileges */
> - unsigned no_new_privs:1;
> + unsigned long atomic_flags; /* Flags needing atomic access. */
>
> /* Revert to default priority/policy when forking */
> unsigned sched_reset_on_fork:1;

Agreed, personally I like it more than seccomp->flags.

But probably it would be better to place the new member before/after
other bitfields to save the space?

Oleg.

2014-06-25 13:53:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On 06/24, Kees Cook wrote:
>
> +static inline void seccomp_assign_mode(struct task_struct *task,
> + unsigned long seccomp_mode)
> +{
> + BUG_ON(!spin_is_locked(&task->sighand->siglock));
> +
> + task->seccomp.mode = seccomp_mode;
> + set_tsk_thread_flag(task, TIF_SECCOMP);
> +}

OK, but unless task == current this can race with secure_computing().
I think this needs smp_mb__before_atomic() and secure_computing() needs
rmb() after test_bit(TIF_SECCOMP).

Otherwise, can't __secure_computing() hit BUG() if it sees the old
mode == SECCOMP_MODE_DISABLED ?

Or seccomp_run_filters() can see ->filters == NULL and WARN(),
smp_load_acquire() only serializes that LOAD with the subsequent memory
operations.

Oleg.

2014-06-25 14:05:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 3/9] seccomp: introduce writer locking

On 06/24, Kees Cook wrote:
>
> @@ -524,6 +529,8 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
> }
> #endif
>
> + spin_lock_irqsave(&current->sighand->siglock, irqflags);
> +

Well, I won't argue if you prefer to use _irqsave "just in case".

But irqs must be enabled in syscall paths, you could use spin_lock_irq().
The same for seccomp_set_mode_filter() added later.

Oleg.

2014-06-25 14:23:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/24, Kees Cook wrote:
>
> +static void seccomp_sync_threads(void)
> +{
> + struct task_struct *thread, *caller;
> +
> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
> +
> + /* Synchronize all threads. */
> + caller = current;
> + for_each_thread(caller, thread) {
> + /* Get a task reference for the new leaf node. */
> + get_seccomp_filter(caller);
> + /*
> + * Drop the task reference to the shared ancestor since
> + * current's path will hold a reference. (This also
> + * allows a put before the assignment.)
> + */
> + put_seccomp_filter(thread);
> + thread->seccomp.filter = caller->seccomp.filter;
> + /* Opt the other thread into seccomp if needed.
> + * As threads are considered to be trust-realm
> + * equivalent (see ptrace_may_access), it is safe to
> + * allow one thread to transition the other.
> + */
> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
> + /*
> + * Don't let an unprivileged task work around
> + * the no_new_privs restriction by creating
> + * a thread that sets it up, enters seccomp,
> + * then dies.
> + */
> + if (task_no_new_privs(caller))
> + task_set_no_new_privs(thread);
> +
> + seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
> + }
> + }
> +}

OK, personally I think this all make sense. I even think that perhaps
SECCOMP_FILTER_FLAG_TSYNC should allow filter == NULL, a thread might
want to "sync" without adding the new filter, but this is minor/offtopic.

But. Doesn't this change add the new security hole?

Obviously, we should not allow to install a filter and then (say) exec
a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.

But what if "thread->seccomp.filter = caller->seccomp.filter" races with
any user of task_no_new_privs() ? Say, suppose this thread has already
passed check_unsafe_exec/etc and it is going to exec the suid binary?

Oleg.

2014-06-25 14:44:18

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 4/9] sched: move no_new_privs into new atomic flags

On Wed, Jun 25, 2014 at 6:43 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Kees Cook wrote:
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1307,8 +1307,7 @@ struct task_struct {
>> * execve */
>> unsigned in_iowait:1;
>>
>> - /* task may not gain privileges */
>> - unsigned no_new_privs:1;
>> + unsigned long atomic_flags; /* Flags needing atomic access. */
>>
>> /* Revert to default priority/policy when forking */
>> unsigned sched_reset_on_fork:1;
>
> Agreed, personally I like it more than seccomp->flags.
>
> But probably it would be better to place the new member before/after
> other bitfields to save the space?

Sure, I'll move it down. (Though I thought the compiler was smarter about that.)

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 14:51:28

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 6:51 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Kees Cook wrote:
>>
>> +static inline void seccomp_assign_mode(struct task_struct *task,
>> + unsigned long seccomp_mode)
>> +{
>> + BUG_ON(!spin_is_locked(&task->sighand->siglock));
>> +
>> + task->seccomp.mode = seccomp_mode;
>> + set_tsk_thread_flag(task, TIF_SECCOMP);
>> +}
>
> OK, but unless task == current this can race with secure_computing().
> I think this needs smp_mb__before_atomic() and secure_computing() needs
> rmb() after test_bit(TIF_SECCOMP).
>
> Otherwise, can't __secure_computing() hit BUG() if it sees the old
> mode == SECCOMP_MODE_DISABLED ?
>
> Or seccomp_run_filters() can see ->filters == NULL and WARN(),
> smp_load_acquire() only serializes that LOAD with the subsequent memory
> operations.

Hm, actually, now I'm worried about smp_load_acquire() being too slow
in run_filters().

The ordering must be:
- task->seccomp.filter must be valid before
- task->seccomp.mode is set, which must be valid before
- TIF_SECCOMP is set

But I don't want to impact secure_computing(). What's the best way to
make sure this ordering is respected?

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 15:08:15

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Kees Cook wrote:
>>
>> +static void seccomp_sync_threads(void)
>> +{
>> + struct task_struct *thread, *caller;
>> +
>> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
>> +
>> + /* Synchronize all threads. */
>> + caller = current;
>> + for_each_thread(caller, thread) {
>> + /* Get a task reference for the new leaf node. */
>> + get_seccomp_filter(caller);
>> + /*
>> + * Drop the task reference to the shared ancestor since
>> + * current's path will hold a reference. (This also
>> + * allows a put before the assignment.)
>> + */
>> + put_seccomp_filter(thread);
>> + thread->seccomp.filter = caller->seccomp.filter;
>> + /* Opt the other thread into seccomp if needed.
>> + * As threads are considered to be trust-realm
>> + * equivalent (see ptrace_may_access), it is safe to
>> + * allow one thread to transition the other.
>> + */
>> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
>> + /*
>> + * Don't let an unprivileged task work around
>> + * the no_new_privs restriction by creating
>> + * a thread that sets it up, enters seccomp,
>> + * then dies.
>> + */
>> + if (task_no_new_privs(caller))
>> + task_set_no_new_privs(thread);
>> +
>> + seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
>> + }
>> + }
>> +}
>
> OK, personally I think this all make sense. I even think that perhaps
> SECCOMP_FILTER_FLAG_TSYNC should allow filter == NULL, a thread might
> want to "sync" without adding the new filter, but this is minor/offtopic.
>
> But. Doesn't this change add a new security hole?
>
> Obviously, we should not allow to install a filter and then (say) exec
> a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.
>
> But what if "thread->seccomp.filter = caller->seccomp.filter" races with
> any user of task_no_new_privs() ? Say, suppose this thread has already
> passed check_unsafe_exec/etc and it is going to exec the suid binary?

Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this?

I wonder if changes to nnp need to "flushed" during syscall entry
instead of getting updated externally/asynchronously? That way it
won't be out of sync with the seccomp mode/filters.

Perhaps secure computing needs to check some (maybe seccomp-only)
atomic flags and flip on the "real" nnp if found?

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 15:10:45

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 1/1] man-pages: seccomp.2: document syscall

On Wed, Jun 25, 2014 at 6:04 AM, One Thousand Gnomes
<[email protected]> wrote:
> On Tue, 24 Jun 2014 13:56:15 -0700
> Kees Cook <[email protected]> wrote:
>
>> Combines documentation from prctl, in-kernel seccomp_filter.txt and
>> dropper.c, along with details specific to the new syscall.
>
> What is the license on the example ? Probably you want to propogate the
> minimal form of the text in seccomp/dropper into the document example to
> avoid confusion ?

What is the license of the other code examples in man-pages?

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 16:10:56

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 7:51 AM, Kees Cook <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 6:51 AM, Oleg Nesterov <[email protected]> wrote:
>> On 06/24, Kees Cook wrote:
>>>
>>> +static inline void seccomp_assign_mode(struct task_struct *task,
>>> + unsigned long seccomp_mode)
>>> +{
>>> + BUG_ON(!spin_is_locked(&task->sighand->siglock));
>>> +
>>> + task->seccomp.mode = seccomp_mode;
>>> + set_tsk_thread_flag(task, TIF_SECCOMP);
>>> +}
>>
>> OK, but unless task == current this can race with secure_computing().
>> I think this needs smp_mb__before_atomic() and secure_computing() needs
>> rmb() after test_bit(TIF_SECCOMP).
>>
>> Otherwise, can't __secure_computing() hit BUG() if it sees the old
>> mode == SECCOMP_MODE_DISABLED ?
>>
>> Or seccomp_run_filters() can see ->filters == NULL and WARN(),
>> smp_load_acquire() only serializes that LOAD with the subsequent memory
>> operations.
>
> Hm, actually, now I'm worried about smp_load_acquire() being too slow
> in run_filters().
>
> The ordering must be:
> - task->seccomp.filter must be valid before
> - task->seccomp.mode is set, which must be valid before
> - TIF_SECCOMP is set
>
> But I don't want to impact secure_computing(). What's the best way to
> make sure this ordering is respected?

Remove the ordering requirement, perhaps?

What if you moved mode into seccomp.filter? Then there would be
little reason to check TIF_SECCOMP from secure_computing; instead, you
could smp_load_acquire (or read_barrier_depends, maybe) seccomp.filter
from secure_computing and pass the result as a parameter to
__secure_computing. Or you could even remove the distinction between
secure_computing and __secure_computing -- it's essentially useless
anyway to split entry hook approaches like my x86 fastpath prototype.

--Andy

2014-06-25 16:53:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > But. Doesn't this change add a new security hole?
> >
> > Obviously, we should not allow to install a filter and then (say) exec
> > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.
> >
> > But what if "thread->seccomp.filter = caller->seccomp.filter" races with
> > any user of task_no_new_privs() ? Say, suppose this thread has already
> > passed check_unsafe_exec/etc and it is going to exec the suid binary?
>
> Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this?

Yes, cred_guard_mutex looks like an obvious choice... Hmm, but somehow
initially I thought that the fix won't be simple. Not sure why.

Yes, at least this should close the race with suid-exec. And there are no
other users. Except apparmor, and I hope you will check it because I simply
do not know what it does ;)

> I wonder if changes to nnp need to "flushed" during syscall entry
> instead of getting updated externally/asynchronously? That way it
> won't be out of sync with the seccomp mode/filters.
>
> Perhaps secure computing needs to check some (maybe seccomp-only)
> atomic flags and flip on the "real" nnp if found?

Not sure I understand you, could you clarify?

But I was also worried that task_no_new_privs(current) is no longer stable
inside the syscall paths, perhaps this is what you meant? However I do not
see something bad here... And this has nothing to do with the race above.



Also. Even ignoring no_new_privs, SECCOMP_FILTER_FLAG_TSYNC is not atomic
and we can do nothing with this fact (unless it try to freeze the thread
group somehow), perhaps it makes sense to document this somehow.

I mean, suppose you want to ensure write-to-file is not possible, so you
do seccomp(SECCOMP_FILTER_FLAG_TSYNC, nack_write_to_file_filter). You can't
assume that this has effect right after seccomp() returns, this can obviously
race with a sub-thread which has already entered sys_write().

Once again, I am not arguing, just I think it makes sense to at least mention
the limitations during the discussion.

Oleg.

2014-06-25 16:54:08

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 9:10 AM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 7:51 AM, Kees Cook <[email protected]> wrote:
>> On Wed, Jun 25, 2014 at 6:51 AM, Oleg Nesterov <[email protected]> wrote:
>>> On 06/24, Kees Cook wrote:
>>>>
>>>> +static inline void seccomp_assign_mode(struct task_struct *task,
>>>> + unsigned long seccomp_mode)
>>>> +{
>>>> + BUG_ON(!spin_is_locked(&task->sighand->siglock));
>>>> +
>>>> + task->seccomp.mode = seccomp_mode;
>>>> + set_tsk_thread_flag(task, TIF_SECCOMP);
>>>> +}
>>>
>>> OK, but unless task == current this can race with secure_computing().
>>> I think this needs smp_mb__before_atomic() and secure_computing() needs
>>> rmb() after test_bit(TIF_SECCOMP).
>>>
>>> Otherwise, can't __secure_computing() hit BUG() if it sees the old
>>> mode == SECCOMP_MODE_DISABLED ?
>>>
>>> Or seccomp_run_filters() can see ->filters == NULL and WARN(),
>>> smp_load_acquire() only serializes that LOAD with the subsequent memory
>>> operations.
>>
>> Hm, actually, now I'm worried about smp_load_acquire() being too slow
>> in run_filters().
>>
>> The ordering must be:
>> - task->seccomp.filter must be valid before
>> - task->seccomp.mode is set, which must be valid before
>> - TIF_SECCOMP is set
>>
>> But I don't want to impact secure_computing(). What's the best way to
>> make sure this ordering is respected?
>
> Remove the ordering requirement, perhaps?
>
> What if you moved mode into seccomp.filter? Then there would be
> little reason to check TIF_SECCOMP from secure_computing; instead, you
> could smp_load_acquire (or read_barrier_depends, maybe) seccomp.filter
> from secure_computing and pass the result as a parameter to
> __secure_computing. Or you could even remove the distinction between
> secure_computing and __secure_computing -- it's essentially useless
> anyway to split entry hook approaches like my x86 fastpath prototype.

The TIF_SECCOMP is needed for the syscall entry path. The check in
secure_computing() is just because the "I am being traced" trigger
includes a call to secure_computing, which filters out tracing
reasons.

Your fast path work would clean a lot of that up, as you say. But it
still doesn't change the ordering check here. TIF_SECCOMP indicates
seccomp.mode must be checked, so that ordering will remain, and if
mode == FILTER, seccomp.filter must be valid.

Isn't there a way we can force the assignment ordering in seccomp_assign_mode()?

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 17:02:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 6:51 AM, Oleg Nesterov <[email protected]> wrote:
> > On 06/24, Kees Cook wrote:
> >>
> >> +static inline void seccomp_assign_mode(struct task_struct *task,
> >> + unsigned long seccomp_mode)
> >> +{
> >> + BUG_ON(!spin_is_locked(&task->sighand->siglock));
> >> +
> >> + task->seccomp.mode = seccomp_mode;
> >> + set_tsk_thread_flag(task, TIF_SECCOMP);
> >> +}
> >
> > OK, but unless task == current this can race with secure_computing().
> > I think this needs smp_mb__before_atomic() and secure_computing() needs
> > rmb() after test_bit(TIF_SECCOMP).
> >
> > Otherwise, can't __secure_computing() hit BUG() if it sees the old
> > mode == SECCOMP_MODE_DISABLED ?
> >
> > Or seccomp_run_filters() can see ->filters == NULL and WARN(),
> > smp_load_acquire() only serializes that LOAD with the subsequent memory
> > operations.
>
> Hm, actually, now I'm worried about smp_load_acquire() being too slow
> in run_filters().
>
> The ordering must be:
> - task->seccomp.filter must be valid before
> - task->seccomp.mode is set, which must be valid before
> - TIF_SECCOMP is set
>
> But I don't want to impact secure_computing(). What's the best way to
> make sure this ordering is respected?

Cough, confused... can't understand even after I read the email from Andy.

We do not care if __secure_computing() misses the recently added filter,
this can happen anyway, whatever we do.

seccomp.mode is frozen after we set it != SECCOMP_MODE_DISABLED.

So we should only worry if set_tsk_thread_flag(TIF_SECCOMP) actually
changes this bit and makes __secure_computing() possible. If we add
smp_mb__before_atomic() into seccomp_assign_mode() and rmb() at the
start of __secure_computing() everything should be fine?

Oleg.

2014-06-25 17:04:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 9:54 AM, Kees Cook <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 9:10 AM, Andy Lutomirski <[email protected]> wrote:
>> On Wed, Jun 25, 2014 at 7:51 AM, Kees Cook <[email protected]> wrote:
>>> On Wed, Jun 25, 2014 at 6:51 AM, Oleg Nesterov <[email protected]> wrote:
>>>> On 06/24, Kees Cook wrote:
>>>>>
>>>>> +static inline void seccomp_assign_mode(struct task_struct *task,
>>>>> + unsigned long seccomp_mode)
>>>>> +{
>>>>> + BUG_ON(!spin_is_locked(&task->sighand->siglock));
>>>>> +
>>>>> + task->seccomp.mode = seccomp_mode;
>>>>> + set_tsk_thread_flag(task, TIF_SECCOMP);
>>>>> +}
>>>>
>>>> OK, but unless task == current this can race with secure_computing().
>>>> I think this needs smp_mb__before_atomic() and secure_computing() needs
>>>> rmb() after test_bit(TIF_SECCOMP).
>>>>
>>>> Otherwise, can't __secure_computing() hit BUG() if it sees the old
>>>> mode == SECCOMP_MODE_DISABLED ?
>>>>
>>>> Or seccomp_run_filters() can see ->filters == NULL and WARN(),
>>>> smp_load_acquire() only serializes that LOAD with the subsequent memory
>>>> operations.
>>>
>>> Hm, actually, now I'm worried about smp_load_acquire() being too slow
>>> in run_filters().
>>>
>>> The ordering must be:
>>> - task->seccomp.filter must be valid before
>>> - task->seccomp.mode is set, which must be valid before
>>> - TIF_SECCOMP is set
>>>
>>> But I don't want to impact secure_computing(). What's the best way to
>>> make sure this ordering is respected?
>>
>> Remove the ordering requirement, perhaps?
>>
>> What if you moved mode into seccomp.filter? Then there would be
>> little reason to check TIF_SECCOMP from secure_computing; instead, you
>> could smp_load_acquire (or read_barrier_depends, maybe) seccomp.filter
>> from secure_computing and pass the result as a parameter to
>> __secure_computing. Or you could even remove the distinction between
>> secure_computing and __secure_computing -- it's essentially useless
>> anyway to split entry hook approaches like my x86 fastpath prototype.
>
> The TIF_SECCOMP is needed for the syscall entry path. The check in
> secure_computing() is just because the "I am being traced" trigger
> includes a call to secure_computing, which filters out tracing
> reasons.

Right. I'm suggesting just checking a single indication that seccomp
is on from the process in the seccomp code so that the order doesn't
matter.

IOW, TIF_SECCOMP causes __secure_computing to be invoked, but the race
only seems to matter because of the warning and the BUG. I think that
both can be fixed if you merge mode into filter so that
__secure_computing atomically checks one condition.

>
> Your fast path work would clean a lot of that up, as you say. But it
> still doesn't change the ordering check here. TIF_SECCOMP indicates
> seccomp.mode must be checked, so that ordering will remain, and if
> mode == FILTER, seccomp.filter must be valid.
>
> Isn't there a way we can force the assignment ordering in seccomp_assign_mode()?

Write the filter, then smp_mb (or maybe a weaker barrier is okay),
then set the bit.

--Andy

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-25 17:09:04

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 7:21 AM, Oleg Nesterov <[email protected]> wrote:
>> >
>> > But. Doesn't this change add a new security hole?
>> >
>> > Obviously, we should not allow to install a filter and then (say) exec
>> > a suid binary, that is why we have no_new_privs/LSM_UNSAFE_NO_NEW_PRIVS.
>> >
>> > But what if "thread->seccomp.filter = caller->seccomp.filter" races with
>> > any user of task_no_new_privs() ? Say, suppose this thread has already
>> > passed check_unsafe_exec/etc and it is going to exec the suid binary?
>>
>> Oh, ew. Yeah. It looks like there's a cred lock to be held to combat this?
>
> Yes, cred_guard_mutex looks like an obvious choice... Hmm, but somehow
> initially I thought that the fix won't be simple. Not sure why.
>
> Yes, at least this should close the race with suid-exec. And there are no
> other users. Except apparmor, and I hope you will check it because I simply
> do not know what it does ;)
>
>> I wonder if changes to nnp need to "flushed" during syscall entry
>> instead of getting updated externally/asynchronously? That way it
>> won't be out of sync with the seccomp mode/filters.
>>
>> Perhaps secure computing needs to check some (maybe seccomp-only)
>> atomic flags and flip on the "real" nnp if found?
>
> Not sure I understand you, could you clarify?

Instead of having TSYNC change the nnp bit, it can set a new flag, say:

task->seccomp.flags |= SECCOMP_NEEDS_NNP;

This would be set along with seccomp.mode, seccomp.filter, and
TIF_SECCOMP. Then, during the next secure_computing() call that thread
makes, it would check the flag:

if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
task->nnp = 1;

This means that nnp couldn't change in the middle of a running syscall.

Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
above would actually make things worse, since now we'd have a thread
with seccomp set up, and no nnp. If it was in the middle of exec,
we're still causing a problem.

I think we'd also need a way to either delay the seccomp changes, or
to notice this condition during exec. Bleh.

What actually happens with a multi-threaded process calls exec? I
assume all the other threads are destroyed?

> But I was also worried that task_no_new_privs(current) is no longer stable
> inside the syscall paths, perhaps this is what you meant? However I do not
> see something bad here... And this has nothing to do with the race above.
>
> Also. Even ignoring no_new_privs, SECCOMP_FILTER_FLAG_TSYNC is not atomic
> and we can do nothing with this fact (unless it try to freeze the thread
> group somehow), perhaps it makes sense to document this somehow.
>
> I mean, suppose you want to ensure write-to-file is not possible, so you
> do seccomp(SECCOMP_FILTER_FLAG_TSYNC, nack_write_to_file_filter). You can't
> assume that this has effect right after seccomp() returns, this can obviously
> race with a sub-thread which has already entered sys_write().
>
> Once again, I am not arguing, just I think it makes sense to at least mention
> the limitations during the discussion.

Right -- this is an accepted limitation. I will call it out
specifically in the man-page; that's a good idea.

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 17:25:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > Yes, at least this should close the race with suid-exec. And there are no
> > other users. Except apparmor, and I hope you will check it because I simply
> > do not know what it does ;)
> >
> >> I wonder if changes to nnp need to "flushed" during syscall entry
> >> instead of getting updated externally/asynchronously? That way it
> >> won't be out of sync with the seccomp mode/filters.
> >>
> >> Perhaps secure computing needs to check some (maybe seccomp-only)
> >> atomic flags and flip on the "real" nnp if found?
> >
> > Not sure I understand you, could you clarify?
>
> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>
> task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>
> This would be set along with seccomp.mode, seccomp.filter, and
> TIF_SECCOMP. Then, during the next secure_computing() call that thread
> makes, it would check the flag:
>
> if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
> task->nnp = 1;
>
> This means that nnp couldn't change in the middle of a running syscall.

Aha, so you were worried about the same thing. Not sure we need this,
but at least I understand you and...

> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
> above would actually make things worse, since now we'd have a thread
> with seccomp set up, and no nnp. If it was in the middle of exec,
> we're still causing a problem.

Yes ;)

> I think we'd also need a way to either delay the seccomp changes, or
> to notice this condition during exec. Bleh.

Hmm. confused again,

> What actually happens with a multi-threaded process calls exec? I
> assume all the other threads are destroyed?

Yes. But this is the point-of-no-return, de_thread() is called after the execing
thared has already passed (say) check_unsafe_exec().

However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
and drops it in install_exec_creds(), so it should solve the problem?

Oleg.

2014-06-25 17:34:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On 06/25, Andy Lutomirski wrote:
>
> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
> then set the bit.

Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().

But I still can't understand the rest of your discussion about the
ordering we need ;)

Oleg.

2014-06-25 17:38:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/25, Andy Lutomirski wrote:
>>
>> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>> then set the bit.
>
> Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>
> But I still can't understand the rest of your discussion about the
> ordering we need ;)

Let me try again from scratch.

Currently there are three relevant variables: TIF_SECCOMP,
seccomp.mode, and seccomp.filter. __secure_computing needs
seccomp.mode and seccomp.filter to be in sync, and it wants (but
doesn't really need) TIF_SECCOMP to be in sync as well.

My suggestion is to rearrange it a bit. Move mode into seccomp.filter
(so that filter == NULL implies no seccomp) and don't check
TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
atomic except for the fact that the seccomp hooks won't be called if
filter != NULL but !TIF_SECCOMP. This removes all ordering
requirements.

Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
In that case, filter needs to be set before TIF_SECCOMP is set, but
that's straightforward.

--Andy

2014-06-25 17:41:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
>> >
>> > Yes, at least this should close the race with suid-exec. And there are no
>> > other users. Except apparmor, and I hope you will check it because I simply
>> > do not know what it does ;)
>> >
>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>> >> instead of getting updated externally/asynchronously? That way it
>> >> won't be out of sync with the seccomp mode/filters.
>> >>
>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>> >> atomic flags and flip on the "real" nnp if found?
>> >
>> > Not sure I understand you, could you clarify?
>>
>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>
>> task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>
>> This would be set along with seccomp.mode, seccomp.filter, and
>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>> makes, it would check the flag:
>>
>> if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>> task->nnp = 1;
>>
>> This means that nnp couldn't change in the middle of a running syscall.
>
> Aha, so you were worried about the same thing. Not sure we need this,
> but at least I understand you and...
>
>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>> above would actually make things worse, since now we'd have a thread
>> with seccomp set up, and no nnp. If it was in the middle of exec,
>> we're still causing a problem.
>
> Yes ;)
>
>> I think we'd also need a way to either delay the seccomp changes, or
>> to notice this condition during exec. Bleh.
>
> Hmm. confused again,
>
>> What actually happens with a multi-threaded process calls exec? I
>> assume all the other threads are destroyed?
>
> Yes. But this is the point-of-no-return, de_thread() is called after the execing
> thared has already passed (say) check_unsafe_exec().
>
> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
> and drops it in install_exec_creds(), so it should solve the problem?

If you rely on this, then please fix this comment in fs/exec.c:

/*
* determine how safe it is to execute the proposed program
* - the caller must hold ->cred_guard_mutex to protect against
* PTRACE_ATTACH
*/
static void check_unsafe_exec(struct linux_binprm *bprm)

It sounds like cred_guard_mutex is there for exactly this reason :)

--Andy

2014-06-25 17:53:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On 06/25, Andy Lutomirski wrote:
>
> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
> > On 06/25, Andy Lutomirski wrote:
> >>
> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
> >> then set the bit.
> >
> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
> >
> > But I still can't understand the rest of your discussion about the
> > ordering we need ;)
>
> Let me try again from scratch.
>
> Currently there are three relevant variables: TIF_SECCOMP,
> seccomp.mode, and seccomp.filter. __secure_computing needs
> seccomp.mode and seccomp.filter to be in sync, and it wants (but
> doesn't really need) TIF_SECCOMP to be in sync as well.
>
> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
> (so that filter == NULL implies no seccomp) and don't check
> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
> atomic except for the fact that the seccomp hooks won't be called if
> filter != NULL but !TIF_SECCOMP. This removes all ordering
> requirements.

Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
unnecessary complication at first glance.

We alredy have TIF_SECCOMP, we need it anyway, and we should only care
about the case when this bit is actually set, so that we can race with
the 1st call of __secure_computing().

Otherwise we are fine: we can miss the new filter anyway, ->mode can't
be changed it is already nonzero.

> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
> In that case, filter needs to be set before TIF_SECCOMP is set, but
> that's straightforward.

Yep. And this is how seccomp_assign_mode() already works? It is called
after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
just it lacks a barrier.

Oleg.

Subject: Re: [PATCH v8 1/1] man-pages: seccomp.2: document syscall

On Wed, Jun 25, 2014 at 5:10 PM, Kees Cook <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 6:04 AM, One Thousand Gnomes
> <[email protected]> wrote:
>> On Tue, 24 Jun 2014 13:56:15 -0700
>> Kees Cook <[email protected]> wrote:
>>
>>> Combines documentation from prctl, in-kernel seccomp_filter.txt and
>>> dropper.c, along with details specific to the new syscall.
>>
>> What is the license on the example ? Probably you want to propogate the
>> minimal form of the text in seccomp/dropper into the document example to
>> avoid confusion ?
>
> What is the license of the other code examples in man-pages?

Typically, just the same as the rest if the page text. Perhaps that
should be rethought for future examples. I haven't thought about it at
length, but, at first glance, I'm not against having separate licenses
for the page text and the code sample.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-06-25 17:57:31

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
>> >
>> > Yes, at least this should close the race with suid-exec. And there are no
>> > other users. Except apparmor, and I hope you will check it because I simply
>> > do not know what it does ;)
>> >
>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>> >> instead of getting updated externally/asynchronously? That way it
>> >> won't be out of sync with the seccomp mode/filters.
>> >>
>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>> >> atomic flags and flip on the "real" nnp if found?
>> >
>> > Not sure I understand you, could you clarify?
>>
>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>
>> task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>
>> This would be set along with seccomp.mode, seccomp.filter, and
>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>> makes, it would check the flag:
>>
>> if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>> task->nnp = 1;
>>
>> This means that nnp couldn't change in the middle of a running syscall.
>
> Aha, so you were worried about the same thing. Not sure we need this,
> but at least I understand you and...
>
>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>> above would actually make things worse, since now we'd have a thread
>> with seccomp set up, and no nnp. If it was in the middle of exec,
>> we're still causing a problem.
>
> Yes ;)
>
>> I think we'd also need a way to either delay the seccomp changes, or
>> to notice this condition during exec. Bleh.
>
> Hmm. confused again,

I mean to suggest that the tsync changes would be stored in each
thread, but somewhere other than the true seccomp struct, but with
TIF_SECCOMP set. When entering secure_computing(), current would check
for the "changes to sync", and apply them, then start the syscall. In
this way, we can never race a syscall (like exec).

>> What actually happens with a multi-threaded process calls exec? I
>> assume all the other threads are destroyed?
>
> Yes. But this is the point-of-no-return, de_thread() is called after the execing
> thared has already passed (say) check_unsafe_exec().
>
> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
> and drops it in install_exec_creds(), so it should solve the problem?

I can't tell yet. I'm still trying to understand the order of
operations here. It looks like de_thread() takes the sighand lock.
do_execve_common does:

prepare_bprm_creds (takes cred_guard_mutex)
check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
prepare_binprm (handles suid escalation, checks nnp separately)
security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
exec_binprm
load_elf_binary
flush_old_exec
de_thread (takes and releases sighand->lock)
install_exec_creds (releases cred_guard_mutex)

I don't see a way to use cred_guard_mutex during tsync (which holds
sighand->lock) without dead-locking. What were you considering here?

-Kees

--
Kees Cook
Chrome OS Security

2014-06-25 18:00:55

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/25, Andy Lutomirski wrote:
>>
>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>> > On 06/25, Andy Lutomirski wrote:
>> >>
>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>> >> then set the bit.
>> >
>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>> >
>> > But I still can't understand the rest of your discussion about the
>> > ordering we need ;)
>>
>> Let me try again from scratch.
>>
>> Currently there are three relevant variables: TIF_SECCOMP,
>> seccomp.mode, and seccomp.filter. __secure_computing needs
>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>
>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>> (so that filter == NULL implies no seccomp) and don't check

This would require that we reimplement mode 1 seccomp via mode 2
filters. Which isn't too hard, but may add complexity.

>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>> atomic except for the fact that the seccomp hooks won't be called if
>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>> requirements.
>
> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
> unnecessary complication at first glance.
>
> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
> about the case when this bit is actually set, so that we can race with
> the 1st call of __secure_computing().
>
> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
> be changed it is already nonzero.
>
>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>> that's straightforward.
>
> Yep. And this is how seccomp_assign_mode() already works? It is called
> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
> just it lacks a barrier.

Right, I think the best solution is to add the barrier. I was
concerned that adding the read barrier in secure_computing would have
a performance impact, though.


--
Kees Cook
Chrome OS Security

2014-06-25 18:08:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>> On 06/25, Andy Lutomirski wrote:
>>>
>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>> > On 06/25, Andy Lutomirski wrote:
>>> >>
>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>> >> then set the bit.
>>> >
>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>> >
>>> > But I still can't understand the rest of your discussion about the
>>> > ordering we need ;)
>>>
>>> Let me try again from scratch.
>>>
>>> Currently there are three relevant variables: TIF_SECCOMP,
>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>
>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>> (so that filter == NULL implies no seccomp) and don't check
>
> This would require that we reimplement mode 1 seccomp via mode 2
> filters. Which isn't too hard, but may add complexity.
>
>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>> atomic except for the fact that the seccomp hooks won't be called if
>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>> requirements.
>>
>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>> unnecessary complication at first glance.
>>
>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>> about the case when this bit is actually set, so that we can race with
>> the 1st call of __secure_computing().
>>
>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>> be changed it is already nonzero.
>>
>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>> that's straightforward.
>>
>> Yep. And this is how seccomp_assign_mode() already works? It is called
>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>> just it lacks a barrier.
>
> Right, I think the best solution is to add the barrier. I was
> concerned that adding the read barrier in secure_computing would have
> a performance impact, though.
>

I can't speak for ARM, but I think that all of the read barriers are
essentially free on x86. (smp_mb is a very different story, but that
shouldn't be needed here.)

--Andy

2014-06-25 18:08:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 3/9] seccomp: introduce writer locking

On 06/24, Kees Cook wrote:
>
> +static void copy_seccomp(struct task_struct *p)
> +{
> +#ifdef CONFIG_SECCOMP
> + /*
> + * Must be called with sighand->lock held, which is common to
> + * all threads in the group. Regardless, nothing special is
> + * needed for the child since it is not yet in the tasklist.
> + */
> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
> +
> + get_seccomp_filter(current);
> + p->seccomp = current->seccomp;
> +
> + if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
> + set_tsk_thread_flag(p, TIF_SECCOMP);
> +#endif
> +}

Wait. But what about no_new_privs? We should copy it as well...

Perhaps this helper should be updated a bit and moved into seccomp.c so
that seccomp_sync_threads() could use it too.

Oleg.

2014-06-25 18:10:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 10:57 AM, Kees Cook <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <[email protected]> wrote:
>> On 06/25, Kees Cook wrote:
>>>
>>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
>>> >
>>> > Yes, at least this should close the race with suid-exec. And there are no
>>> > other users. Except apparmor, and I hope you will check it because I simply
>>> > do not know what it does ;)
>>> >
>>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>>> >> instead of getting updated externally/asynchronously? That way it
>>> >> won't be out of sync with the seccomp mode/filters.
>>> >>
>>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>>> >> atomic flags and flip on the "real" nnp if found?
>>> >
>>> > Not sure I understand you, could you clarify?
>>>
>>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>>
>>> task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>>
>>> This would be set along with seccomp.mode, seccomp.filter, and
>>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>>> makes, it would check the flag:
>>>
>>> if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>>> task->nnp = 1;
>>>
>>> This means that nnp couldn't change in the middle of a running syscall.
>>
>> Aha, so you were worried about the same thing. Not sure we need this,
>> but at least I understand you and...
>>
>>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>>> above would actually make things worse, since now we'd have a thread
>>> with seccomp set up, and no nnp. If it was in the middle of exec,
>>> we're still causing a problem.
>>
>> Yes ;)
>>
>>> I think we'd also need a way to either delay the seccomp changes, or
>>> to notice this condition during exec. Bleh.
>>
>> Hmm. confused again,
>
> I mean to suggest that the tsync changes would be stored in each
> thread, but somewhere other than the true seccomp struct, but with
> TIF_SECCOMP set. When entering secure_computing(), current would check
> for the "changes to sync", and apply them, then start the syscall. In
> this way, we can never race a syscall (like exec).

I'm not sure that helps. If you set a pending filter part-way through
exec, and exec copies that pending filter but doesn't notice NNP, then
there's an exploitable race.

>
>>> What actually happens with a multi-threaded process calls exec? I
>>> assume all the other threads are destroyed?
>>
>> Yes. But this is the point-of-no-return, de_thread() is called after the execing
>> thared has already passed (say) check_unsafe_exec().
>>
>> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
>> and drops it in install_exec_creds(), so it should solve the problem?
>
> I can't tell yet. I'm still trying to understand the order of
> operations here. It looks like de_thread() takes the sighand lock.
> do_execve_common does:
>
> prepare_bprm_creds (takes cred_guard_mutex)
> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
> prepare_binprm (handles suid escalation, checks nnp separately)
> security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
> exec_binprm
> load_elf_binary
> flush_old_exec
> de_thread (takes and releases sighand->lock)
> install_exec_creds (releases cred_guard_mutex)
>
> I don't see a way to use cred_guard_mutex during tsync (which holds
> sighand->lock) without dead-locking. What were you considering here?
>

Grab cred_guard_mutex and then sighand->lock, perhaps?

> -Kees
>
> --
> Kees Cook
> Chrome OS Security



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-25 18:21:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/25, Kees Cook wrote:
>
> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
> > and drops it in install_exec_creds(), so it should solve the problem?
>
> I can't tell yet. I'm still trying to understand the order of
> operations here. It looks like de_thread() takes the sighand lock.
> do_execve_common does:
>
> prepare_bprm_creds (takes cred_guard_mutex)
> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
> prepare_binprm (handles suid escalation, checks nnp separately)
> security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
> exec_binprm
> load_elf_binary
> flush_old_exec
> de_thread (takes and releases sighand->lock)
> install_exec_creds (releases cred_guard_mutex)

Yes, and note that when cred_guard_mutex is dropped all other threads
are already killed,

> I don't see a way to use cred_guard_mutex during tsync (which holds
> sighand->lock) without dead-locking. What were you considering here?

Just take/drop current->signal->cred_guard_mutex along with ->siglock
in seccomp_set_mode_filter() ? Unconditionally on depending on
SECCOMP_FILTER_FLAG_TSYNC.

Oleg.

2014-06-25 18:25:31

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 11:09 AM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 10:57 AM, Kees Cook <[email protected]> wrote:
>> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <[email protected]> wrote:
>>> On 06/25, Kees Cook wrote:
>>>>
>>>> On Wed, Jun 25, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
>>>> >
>>>> > Yes, at least this should close the race with suid-exec. And there are no
>>>> > other users. Except apparmor, and I hope you will check it because I simply
>>>> > do not know what it does ;)
>>>> >
>>>> >> I wonder if changes to nnp need to "flushed" during syscall entry
>>>> >> instead of getting updated externally/asynchronously? That way it
>>>> >> won't be out of sync with the seccomp mode/filters.
>>>> >>
>>>> >> Perhaps secure computing needs to check some (maybe seccomp-only)
>>>> >> atomic flags and flip on the "real" nnp if found?
>>>> >
>>>> > Not sure I understand you, could you clarify?
>>>>
>>>> Instead of having TSYNC change the nnp bit, it can set a new flag, say:
>>>>
>>>> task->seccomp.flags |= SECCOMP_NEEDS_NNP;
>>>>
>>>> This would be set along with seccomp.mode, seccomp.filter, and
>>>> TIF_SECCOMP. Then, during the next secure_computing() call that thread
>>>> makes, it would check the flag:
>>>>
>>>> if (task->seccomp.flags & SECCOMP_NEEDS_NNP)
>>>> task->nnp = 1;
>>>>
>>>> This means that nnp couldn't change in the middle of a running syscall.
>>>
>>> Aha, so you were worried about the same thing. Not sure we need this,
>>> but at least I understand you and...
>>>
>>>> Hmmm. Perhaps this doesn't solve anything, though? Perhaps my proposal
>>>> above would actually make things worse, since now we'd have a thread
>>>> with seccomp set up, and no nnp. If it was in the middle of exec,
>>>> we're still causing a problem.
>>>
>>> Yes ;)
>>>
>>>> I think we'd also need a way to either delay the seccomp changes, or
>>>> to notice this condition during exec. Bleh.
>>>
>>> Hmm. confused again,
>>
>> I mean to suggest that the tsync changes would be stored in each
>> thread, but somewhere other than the true seccomp struct, but with
>> TIF_SECCOMP set. When entering secure_computing(), current would check
>> for the "changes to sync", and apply them, then start the syscall. In
>> this way, we can never race a syscall (like exec).
>
> I'm not sure that helps. If you set a pending filter part-way through
> exec, and exec copies that pending filter but doesn't notice NNP, then
> there's an exploitable race.
>
>>
>>>> What actually happens with a multi-threaded process calls exec? I
>>>> assume all the other threads are destroyed?
>>>
>>> Yes. But this is the point-of-no-return, de_thread() is called after the execing
>>> thared has already passed (say) check_unsafe_exec().
>>>
>>> However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
>>> and drops it in install_exec_creds(), so it should solve the problem?
>>
>> I can't tell yet. I'm still trying to understand the order of
>> operations here. It looks like de_thread() takes the sighand lock.
>> do_execve_common does:
>>
>> prepare_bprm_creds (takes cred_guard_mutex)
>> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
>> prepare_binprm (handles suid escalation, checks nnp separately)
>> security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
>> exec_binprm
>> load_elf_binary
>> flush_old_exec
>> de_thread (takes and releases sighand->lock)
>> install_exec_creds (releases cred_guard_mutex)
>>
>> I don't see a way to use cred_guard_mutex during tsync (which holds
>> sighand->lock) without dead-locking. What were you considering here?
>
> Grab cred_guard_mutex and then sighand->lock, perhaps?

Ah, yes, task->signal is like sighand: shared across all threads.

--
Kees Cook
Chrome OS Security

2014-06-25 18:30:50

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 3/9] seccomp: introduce writer locking

On 06/25, Oleg Nesterov wrote:
>
> On 06/24, Kees Cook wrote:
> >
> > +static void copy_seccomp(struct task_struct *p)
> > +{
> > +#ifdef CONFIG_SECCOMP
> > + /*
> > + * Must be called with sighand->lock held, which is common to
> > + * all threads in the group. Regardless, nothing special is
> > + * needed for the child since it is not yet in the tasklist.
> > + */
> > + BUG_ON(!spin_is_locked(&current->sighand->siglock));
> > +
> > + get_seccomp_filter(current);
> > + p->seccomp = current->seccomp;
> > +
> > + if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
> > + set_tsk_thread_flag(p, TIF_SECCOMP);
> > +#endif
> > +}
>
> Wait. But what about no_new_privs? We should copy it as well...
>
> Perhaps this helper should be updated a bit and moved into seccomp.c so
> that seccomp_sync_threads() could use it too.

This way we can also unexport get_seccomp_filter().

Oleg.

2014-06-25 18:31:52

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 9/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Wed, Jun 25, 2014 at 11:20 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/25, Kees Cook wrote:
>>
>> On Wed, Jun 25, 2014 at 10:24 AM, Oleg Nesterov <[email protected]> wrote:
>> >
>> > However, do_execve() takes cred_guard_mutex at the start in prepare_bprm_creds()
>> > and drops it in install_exec_creds(), so it should solve the problem?
>>
>> I can't tell yet. I'm still trying to understand the order of
>> operations here. It looks like de_thread() takes the sighand lock.
>> do_execve_common does:
>>
>> prepare_bprm_creds (takes cred_guard_mutex)
>> check_unsafe_exec (checks nnp to set LSM_UNSAFE_NO_NEW_PRIVS)
>> prepare_binprm (handles suid escalation, checks nnp separately)
>> security_bprm_set_creds (checks LSM_UNSAFE_NO_NEW_PRIVS)
>> exec_binprm
>> load_elf_binary
>> flush_old_exec
>> de_thread (takes and releases sighand->lock)
>> install_exec_creds (releases cred_guard_mutex)
>
> Yes, and note that when cred_guard_mutex is dropped all other threads
> are already killed,
>
>> I don't see a way to use cred_guard_mutex during tsync (which holds
>> sighand->lock) without dead-locking. What were you considering here?
>
> Just take/drop current->signal->cred_guard_mutex along with ->siglock
> in seccomp_set_mode_filter() ? Unconditionally on depending on
> SECCOMP_FILTER_FLAG_TSYNC.

Yeah, this looks good. *whew* Testing it now, so far so good.

Thanks!

-Kees

--
Kees Cook
Chrome OS Security

2014-06-27 17:27:20

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 3/9] seccomp: introduce writer locking

On Wed, Jun 25, 2014 at 11:07 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Kees Cook wrote:
>>
>> +static void copy_seccomp(struct task_struct *p)
>> +{
>> +#ifdef CONFIG_SECCOMP
>> + /*
>> + * Must be called with sighand->lock held, which is common to
>> + * all threads in the group. Regardless, nothing special is
>> + * needed for the child since it is not yet in the tasklist.
>> + */
>> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
>> +
>> + get_seccomp_filter(current);
>> + p->seccomp = current->seccomp;
>> +
>> + if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
>> + set_tsk_thread_flag(p, TIF_SECCOMP);
>> +#endif
>> +}
>
> Wait. But what about no_new_privs? We should copy it as well...
>
> Perhaps this helper should be updated a bit and moved into seccomp.c so
> that seccomp_sync_threads() could use it too.

Ah! Yes. I had been thinking it had been copied during the task_struct
duplication, but that would have been before holding sighand->lock, so
it needs explicit recopying. Thanks!

-Kees

--
Kees Cook
Chrome OS Security

2014-06-27 18:33:52

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Wed, Jun 25, 2014 at 11:07 AM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
>> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>>> On 06/25, Andy Lutomirski wrote:
>>>>
>>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>>> > On 06/25, Andy Lutomirski wrote:
>>>> >>
>>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>>> >> then set the bit.
>>>> >
>>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>>> >
>>>> > But I still can't understand the rest of your discussion about the
>>>> > ordering we need ;)
>>>>
>>>> Let me try again from scratch.
>>>>
>>>> Currently there are three relevant variables: TIF_SECCOMP,
>>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>>
>>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>>> (so that filter == NULL implies no seccomp) and don't check
>>
>> This would require that we reimplement mode 1 seccomp via mode 2
>> filters. Which isn't too hard, but may add complexity.
>>
>>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>>> atomic except for the fact that the seccomp hooks won't be called if
>>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>>> requirements.
>>>
>>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>>> unnecessary complication at first glance.
>>>
>>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>>> about the case when this bit is actually set, so that we can race with
>>> the 1st call of __secure_computing().
>>>
>>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>>> be changed it is already nonzero.
>>>
>>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>>> that's straightforward.
>>>
>>> Yep. And this is how seccomp_assign_mode() already works? It is called
>>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>>> just it lacks a barrier.
>>
>> Right, I think the best solution is to add the barrier. I was
>> concerned that adding the read barrier in secure_computing would have
>> a performance impact, though.
>>
>
> I can't speak for ARM, but I think that all of the read barriers are
> essentially free on x86. (smp_mb is a very different story, but that
> shouldn't be needed here.)

It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html

If I skip the rmb in the secure_computing call before checking mode,
it sounds like I run the risk of racing an out-of-order TIF_SECCOMP vs
mode and filter. This seems unlikely to me, given an addition of the
smp_mb__before_atomic() during the seccomp_assign_mode()? I guess I
don't have a sense of how aggressively ARM might do data caching in
this area. Could the other thread actually see TIF_SECCOMP get set but
still have an out of date copy of seccomp.mode?

I really want to avoid adding anything to the secure_computing()
execution path. :(

-Kees

--
Kees Cook
Chrome OS Security

2014-06-27 18:39:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 11:33 AM, Kees Cook <[email protected]> wrote:
> On Wed, Jun 25, 2014 at 11:07 AM, Andy Lutomirski <[email protected]> wrote:
>> On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
>>> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>>>> On 06/25, Andy Lutomirski wrote:
>>>>>
>>>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>>>> > On 06/25, Andy Lutomirski wrote:
>>>>> >>
>>>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>>>> >> then set the bit.
>>>>> >
>>>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>>>> >
>>>>> > But I still can't understand the rest of your discussion about the
>>>>> > ordering we need ;)
>>>>>
>>>>> Let me try again from scratch.
>>>>>
>>>>> Currently there are three relevant variables: TIF_SECCOMP,
>>>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>>>
>>>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>>>> (so that filter == NULL implies no seccomp) and don't check
>>>
>>> This would require that we reimplement mode 1 seccomp via mode 2
>>> filters. Which isn't too hard, but may add complexity.
>>>
>>>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>>>> atomic except for the fact that the seccomp hooks won't be called if
>>>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>>>> requirements.
>>>>
>>>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>>>> unnecessary complication at first glance.
>>>>
>>>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>>>> about the case when this bit is actually set, so that we can race with
>>>> the 1st call of __secure_computing().
>>>>
>>>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>>>> be changed it is already nonzero.
>>>>
>>>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>>>> that's straightforward.
>>>>
>>>> Yep. And this is how seccomp_assign_mode() already works? It is called
>>>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>>>> just it lacks a barrier.
>>>
>>> Right, I think the best solution is to add the barrier. I was
>>> concerned that adding the read barrier in secure_computing would have
>>> a performance impact, though.
>>>
>>
>> I can't speak for ARM, but I think that all of the read barriers are
>> essentially free on x86. (smp_mb is a very different story, but that
>> shouldn't be needed here.)
>
> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html
>
> If I skip the rmb in the secure_computing call before checking mode,
> it sounds like I run the risk of racing an out-of-order TIF_SECCOMP vs
> mode and filter. This seems unlikely to me, given an addition of the
> smp_mb__before_atomic() during the seccomp_assign_mode()? I guess I
> don't have a sense of how aggressively ARM might do data caching in
> this area. Could the other thread actually see TIF_SECCOMP get set but
> still have an out of date copy of seccomp.mode?
>
> I really want to avoid adding anything to the secure_computing()
> execution path. :(

Hence my suggestion to make the ordering not matter. No ordering
requirement, no barriers.

--Andy

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-27 18:52:35

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 11:39 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:33 AM, Kees Cook <[email protected]> wrote:
>> On Wed, Jun 25, 2014 at 11:07 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
>>>> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>>>>> On 06/25, Andy Lutomirski wrote:
>>>>>>
>>>>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>> > On 06/25, Andy Lutomirski wrote:
>>>>>> >>
>>>>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>>>>> >> then set the bit.
>>>>>> >
>>>>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>>>>> >
>>>>>> > But I still can't understand the rest of your discussion about the
>>>>>> > ordering we need ;)
>>>>>>
>>>>>> Let me try again from scratch.
>>>>>>
>>>>>> Currently there are three relevant variables: TIF_SECCOMP,
>>>>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>>>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>>>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>>>>
>>>>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>>>>> (so that filter == NULL implies no seccomp) and don't check
>>>>
>>>> This would require that we reimplement mode 1 seccomp via mode 2
>>>> filters. Which isn't too hard, but may add complexity.
>>>>
>>>>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>>>>> atomic except for the fact that the seccomp hooks won't be called if
>>>>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>>>>> requirements.
>>>>>
>>>>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>>>>> unnecessary complication at first glance.
>>>>>
>>>>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>>>>> about the case when this bit is actually set, so that we can race with
>>>>> the 1st call of __secure_computing().
>>>>>
>>>>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>>>>> be changed it is already nonzero.
>>>>>
>>>>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>>>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>>>>> that's straightforward.
>>>>>
>>>>> Yep. And this is how seccomp_assign_mode() already works? It is called
>>>>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>>>>> just it lacks a barrier.
>>>>
>>>> Right, I think the best solution is to add the barrier. I was
>>>> concerned that adding the read barrier in secure_computing would have
>>>> a performance impact, though.
>>>>
>>>
>>> I can't speak for ARM, but I think that all of the read barriers are
>>> essentially free on x86. (smp_mb is a very different story, but that
>>> shouldn't be needed here.)
>>
>> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html
>>
>> If I skip the rmb in the secure_computing call before checking mode,
>> it sounds like I run the risk of racing an out-of-order TIF_SECCOMP vs
>> mode and filter. This seems unlikely to me, given an addition of the
>> smp_mb__before_atomic() during the seccomp_assign_mode()? I guess I
>> don't have a sense of how aggressively ARM might do data caching in
>> this area. Could the other thread actually see TIF_SECCOMP get set but
>> still have an out of date copy of seccomp.mode?
>>
>> I really want to avoid adding anything to the secure_computing()
>> execution path. :(
>
> Hence my suggestion to make the ordering not matter. No ordering
> requirement, no barriers.

I may be misunderstanding something, but I think there's still an
ordering problem. We'll have TIF_SECCOMP already, so if we enter
secure_computing with a NULL filter, we'll kill the process.

Merging .mode and .filter would remove one of the race failure paths:
having TIF_SECCOMP and not having a mode set (leading to BUG). With
the merge, we could still race and land in the same place as have
TIF_SECCOMP and mode==2, but filter==NULL, leading to WARN and kill.

I guess the question is how large is the race risk on ARM? Is it
possible to have TIF_SECCOMP that far out of sync for the thread?

-Kees

--
Kees Cook
Chrome OS Security

2014-06-27 18:56:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 11:52 AM, Kees Cook <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:39 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 11:33 AM, Kees Cook <[email protected]> wrote:
>>> On Wed, Jun 25, 2014 at 11:07 AM, Andy Lutomirski <[email protected]> wrote:
>>>> On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
>>>>> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>> On 06/25, Andy Lutomirski wrote:
>>>>>>>
>>>>>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>>> > On 06/25, Andy Lutomirski wrote:
>>>>>>> >>
>>>>>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>>>>>> >> then set the bit.
>>>>>>> >
>>>>>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>>>>>> >
>>>>>>> > But I still can't understand the rest of your discussion about the
>>>>>>> > ordering we need ;)
>>>>>>>
>>>>>>> Let me try again from scratch.
>>>>>>>
>>>>>>> Currently there are three relevant variables: TIF_SECCOMP,
>>>>>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>>>>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>>>>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>>>>>
>>>>>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>>>>>> (so that filter == NULL implies no seccomp) and don't check
>>>>>
>>>>> This would require that we reimplement mode 1 seccomp via mode 2
>>>>> filters. Which isn't too hard, but may add complexity.
>>>>>
>>>>>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>>>>>> atomic except for the fact that the seccomp hooks won't be called if
>>>>>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>>>>>> requirements.
>>>>>>
>>>>>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>>>>>> unnecessary complication at first glance.
>>>>>>
>>>>>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>>>>>> about the case when this bit is actually set, so that we can race with
>>>>>> the 1st call of __secure_computing().
>>>>>>
>>>>>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>>>>>> be changed it is already nonzero.
>>>>>>
>>>>>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>>>>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>>>>>> that's straightforward.
>>>>>>
>>>>>> Yep. And this is how seccomp_assign_mode() already works? It is called
>>>>>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>>>>>> just it lacks a barrier.
>>>>>
>>>>> Right, I think the best solution is to add the barrier. I was
>>>>> concerned that adding the read barrier in secure_computing would have
>>>>> a performance impact, though.
>>>>>
>>>>
>>>> I can't speak for ARM, but I think that all of the read barriers are
>>>> essentially free on x86. (smp_mb is a very different story, but that
>>>> shouldn't be needed here.)
>>>
>>> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html
>>>
>>> If I skip the rmb in the secure_computing call before checking mode,
>>> it sounds like I run the risk of racing an out-of-order TIF_SECCOMP vs
>>> mode and filter. This seems unlikely to me, given an addition of the
>>> smp_mb__before_atomic() during the seccomp_assign_mode()? I guess I
>>> don't have a sense of how aggressively ARM might do data caching in
>>> this area. Could the other thread actually see TIF_SECCOMP get set but
>>> still have an out of date copy of seccomp.mode?
>>>
>>> I really want to avoid adding anything to the secure_computing()
>>> execution path. :(
>>
>> Hence my suggestion to make the ordering not matter. No ordering
>> requirement, no barriers.
>
> I may be misunderstanding something, but I think there's still an
> ordering problem. We'll have TIF_SECCOMP already, so if we enter
> secure_computing with a NULL filter, we'll kill the process.
>
> Merging .mode and .filter would remove one of the race failure paths:
> having TIF_SECCOMP and not having a mode set (leading to BUG). With
> the merge, we could still race and land in the same place as have
> TIF_SECCOMP and mode==2, but filter==NULL, leading to WARN and kill.

You could just make secure_computing do nothing if filter == NULL.
It's probably faster to test that than TIF_SECCOMP anyway, since you
need to read the filter cacheline regardless, and testing a regular
variable for non-NULLness might be faster than an atomic bit test
operation. (Or may not -- I don't know.)

>
> I guess the question is how large is the race risk on ARM? Is it
> possible to have TIF_SECCOMP that far out of sync for the thread?

Dunno. I don't like leaving crashy known races around.

--Andy



>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-27 19:04:31

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 11:56 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:52 AM, Kees Cook <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 11:39 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 11:33 AM, Kees Cook <[email protected]> wrote:
>>>> On Wed, Jun 25, 2014 at 11:07 AM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
>>>>>> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>>> On 06/25, Andy Lutomirski wrote:
>>>>>>>>
>>>>>>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>>>> > On 06/25, Andy Lutomirski wrote:
>>>>>>>> >>
>>>>>>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>>>>>>> >> then set the bit.
>>>>>>>> >
>>>>>>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>>>>>>> >
>>>>>>>> > But I still can't understand the rest of your discussion about the
>>>>>>>> > ordering we need ;)
>>>>>>>>
>>>>>>>> Let me try again from scratch.
>>>>>>>>
>>>>>>>> Currently there are three relevant variables: TIF_SECCOMP,
>>>>>>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>>>>>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>>>>>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>>>>>>
>>>>>>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>>>>>>> (so that filter == NULL implies no seccomp) and don't check
>>>>>>
>>>>>> This would require that we reimplement mode 1 seccomp via mode 2
>>>>>> filters. Which isn't too hard, but may add complexity.
>>>>>>
>>>>>>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>>>>>>> atomic except for the fact that the seccomp hooks won't be called if
>>>>>>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>>>>>>> requirements.
>>>>>>>
>>>>>>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>>>>>>> unnecessary complication at first glance.
>>>>>>>
>>>>>>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>>>>>>> about the case when this bit is actually set, so that we can race with
>>>>>>> the 1st call of __secure_computing().
>>>>>>>
>>>>>>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>>>>>>> be changed it is already nonzero.
>>>>>>>
>>>>>>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>>>>>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>>>>>>> that's straightforward.
>>>>>>>
>>>>>>> Yep. And this is how seccomp_assign_mode() already works? It is called
>>>>>>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>>>>>>> just it lacks a barrier.
>>>>>>
>>>>>> Right, I think the best solution is to add the barrier. I was
>>>>>> concerned that adding the read barrier in secure_computing would have
>>>>>> a performance impact, though.
>>>>>>
>>>>>
>>>>> I can't speak for ARM, but I think that all of the read barriers are
>>>>> essentially free on x86. (smp_mb is a very different story, but that
>>>>> shouldn't be needed here.)
>>>>
>>>> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html
>>>>
>>>> If I skip the rmb in the secure_computing call before checking mode,
>>>> it sounds like I run the risk of racing an out-of-order TIF_SECCOMP vs
>>>> mode and filter. This seems unlikely to me, given an addition of the
>>>> smp_mb__before_atomic() during the seccomp_assign_mode()? I guess I
>>>> don't have a sense of how aggressively ARM might do data caching in
>>>> this area. Could the other thread actually see TIF_SECCOMP get set but
>>>> still have an out of date copy of seccomp.mode?
>>>>
>>>> I really want to avoid adding anything to the secure_computing()
>>>> execution path. :(
>>>
>>> Hence my suggestion to make the ordering not matter. No ordering
>>> requirement, no barriers.
>>
>> I may be misunderstanding something, but I think there's still an
>> ordering problem. We'll have TIF_SECCOMP already, so if we enter
>> secure_computing with a NULL filter, we'll kill the process.
>>
>> Merging .mode and .filter would remove one of the race failure paths:
>> having TIF_SECCOMP and not having a mode set (leading to BUG). With
>> the merge, we could still race and land in the same place as have
>> TIF_SECCOMP and mode==2, but filter==NULL, leading to WARN and kill.
>
> You could just make secure_computing do nothing if filter == NULL.
> It's probably faster to test that than TIF_SECCOMP anyway, since you
> need to read the filter cacheline regardless, and testing a regular
> variable for non-NULLness might be faster than an atomic bit test
> operation. (Or may not -- I don't know.)

I am uncomfortable about making filter == NULL be a "fail open"
condition if TIF_SECCOMP is set.

>> I guess the question is how large is the race risk on ARM? Is it
>> possible to have TIF_SECCOMP that far out of sync for the thread?
>
> Dunno. I don't like leaving crashy known races around.

Yeah, me too. Hrmpf. I will do some rmb() timing tests...

-Kees

--
Kees Cook
Chrome OS Security

2014-06-27 19:11:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 12:04 PM, Kees Cook <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:56 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 11:52 AM, Kees Cook <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 11:39 AM, Andy Lutomirski <[email protected]> wrote:
>>>> On Fri, Jun 27, 2014 at 11:33 AM, Kees Cook <[email protected]> wrote:
>>>>> On Wed, Jun 25, 2014 at 11:07 AM, Andy Lutomirski <[email protected]> wrote:
>>>>>> On Wed, Jun 25, 2014 at 11:00 AM, Kees Cook <[email protected]> wrote:
>>>>>>> On Wed, Jun 25, 2014 at 10:51 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>>>> On 06/25, Andy Lutomirski wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Jun 25, 2014 at 10:32 AM, Oleg Nesterov <[email protected]> wrote:
>>>>>>>>> > On 06/25, Andy Lutomirski wrote:
>>>>>>>>> >>
>>>>>>>>> >> Write the filter, then smp_mb (or maybe a weaker barrier is okay),
>>>>>>>>> >> then set the bit.
>>>>>>>>> >
>>>>>>>>> > Yes, exactly, this is what I meant. Plas rmb() in __secure_computing().
>>>>>>>>> >
>>>>>>>>> > But I still can't understand the rest of your discussion about the
>>>>>>>>> > ordering we need ;)
>>>>>>>>>
>>>>>>>>> Let me try again from scratch.
>>>>>>>>>
>>>>>>>>> Currently there are three relevant variables: TIF_SECCOMP,
>>>>>>>>> seccomp.mode, and seccomp.filter. __secure_computing needs
>>>>>>>>> seccomp.mode and seccomp.filter to be in sync, and it wants (but
>>>>>>>>> doesn't really need) TIF_SECCOMP to be in sync as well.
>>>>>>>>>
>>>>>>>>> My suggestion is to rearrange it a bit. Move mode into seccomp.filter
>>>>>>>>> (so that filter == NULL implies no seccomp) and don't check
>>>>>>>
>>>>>>> This would require that we reimplement mode 1 seccomp via mode 2
>>>>>>> filters. Which isn't too hard, but may add complexity.
>>>>>>>
>>>>>>>>> TIF_SECCOMP in secure_computing. Then turning on seccomp is entirely
>>>>>>>>> atomic except for the fact that the seccomp hooks won't be called if
>>>>>>>>> filter != NULL but !TIF_SECCOMP. This removes all ordering
>>>>>>>>> requirements.
>>>>>>>>
>>>>>>>> Ah, got it, thanks. Perhaps I missed somehing, but to me this looks like
>>>>>>>> unnecessary complication at first glance.
>>>>>>>>
>>>>>>>> We alredy have TIF_SECCOMP, we need it anyway, and we should only care
>>>>>>>> about the case when this bit is actually set, so that we can race with
>>>>>>>> the 1st call of __secure_computing().
>>>>>>>>
>>>>>>>> Otherwise we are fine: we can miss the new filter anyway, ->mode can't
>>>>>>>> be changed it is already nonzero.
>>>>>>>>
>>>>>>>>> Alternatively, __secure_computing could still BUG_ON(!seccomp.filter).
>>>>>>>>> In that case, filter needs to be set before TIF_SECCOMP is set, but
>>>>>>>>> that's straightforward.
>>>>>>>>
>>>>>>>> Yep. And this is how seccomp_assign_mode() already works? It is called
>>>>>>>> after we change ->filter chain, it changes ->mode before set(TIF_SECCOMP)
>>>>>>>> just it lacks a barrier.
>>>>>>>
>>>>>>> Right, I think the best solution is to add the barrier. I was
>>>>>>> concerned that adding the read barrier in secure_computing would have
>>>>>>> a performance impact, though.
>>>>>>>
>>>>>>
>>>>>> I can't speak for ARM, but I think that all of the read barriers are
>>>>>> essentially free on x86. (smp_mb is a very different story, but that
>>>>>> shouldn't be needed here.)
>>>>>
>>>>> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>>>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html
>>>>>
>>>>> If I skip the rmb in the secure_computing call before checking mode,
>>>>> it sounds like I run the risk of racing an out-of-order TIF_SECCOMP vs
>>>>> mode and filter. This seems unlikely to me, given an addition of the
>>>>> smp_mb__before_atomic() during the seccomp_assign_mode()? I guess I
>>>>> don't have a sense of how aggressively ARM might do data caching in
>>>>> this area. Could the other thread actually see TIF_SECCOMP get set but
>>>>> still have an out of date copy of seccomp.mode?
>>>>>
>>>>> I really want to avoid adding anything to the secure_computing()
>>>>> execution path. :(
>>>>
>>>> Hence my suggestion to make the ordering not matter. No ordering
>>>> requirement, no barriers.
>>>
>>> I may be misunderstanding something, but I think there's still an
>>> ordering problem. We'll have TIF_SECCOMP already, so if we enter
>>> secure_computing with a NULL filter, we'll kill the process.
>>>
>>> Merging .mode and .filter would remove one of the race failure paths:
>>> having TIF_SECCOMP and not having a mode set (leading to BUG). With
>>> the merge, we could still race and land in the same place as have
>>> TIF_SECCOMP and mode==2, but filter==NULL, leading to WARN and kill.
>>
>> You could just make secure_computing do nothing if filter == NULL.
>> It's probably faster to test that than TIF_SECCOMP anyway, since you
>> need to read the filter cacheline regardless, and testing a regular
>> variable for non-NULLness might be faster than an atomic bit test
>> operation. (Or may not -- I don't know.)
>
> I am uncomfortable about making filter == NULL be a "fail open"
> condition if TIF_SECCOMP is set.

I'm unconvinced here. TIF_SECCOMP unset is already a fail-open
condition if filter is set.

--Andy

2014-06-27 19:29:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On 06/27, Kees Cook wrote:
>
> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.htm
>
> ...
>
> I really want to avoid adding anything to the secure_computing()
> execution path. :(

I must have missed something but I do not understand your concerns.

__secure_computing() is not trivial, and we are going to execute the
filters. Do you really think rmb() can add the noticeable difference?

Not to mention that we can only get here if we take the slow syscall
enter path due to TIF_SECCOMP...

Oleg.

2014-06-27 19:31:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 12:27 PM, Oleg Nesterov <[email protected]> wrote:
> On 06/27, Kees Cook wrote:
>>
>> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.htm
>>
>> ...
>>
>> I really want to avoid adding anything to the secure_computing()
>> execution path. :(
>
> I must have missed something but I do not understand your concerns.
>
> __secure_computing() is not trivial, and we are going to execute the
> filters. Do you really think rmb() can add the noticeable difference?
>
> Not to mention that we can only get here if we take the slow syscall
> enter path due to TIF_SECCOMP...
>

On my box, with my fancy multi-phase seccomp patches, the total
seccomp overhead for a very short filter is about 13ns. Adding a full
barrier would add several ns, I think.

Admittedly, this is x86, not ARM, so comparisons here are completely
bogus. And that read memory barrier doesn't even need an instruction
on x86. But still, let's try to keep this fast.

--Andy

2014-06-27 19:57:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On 06/27, Andy Lutomirski wrote:
>
> On Fri, Jun 27, 2014 at 12:27 PM, Oleg Nesterov <[email protected]> wrote:
> > On 06/27, Kees Cook wrote:
> >>
> >> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.htm
> >>
> >> ...
> >>
> >> I really want to avoid adding anything to the secure_computing()
> >> execution path. :(
> >
> > I must have missed something but I do not understand your concerns.
> >
> > __secure_computing() is not trivial, and we are going to execute the
> > filters. Do you really think rmb() can add the noticeable difference?
> >
> > Not to mention that we can only get here if we take the slow syscall
> > enter path due to TIF_SECCOMP...
> >
>
> On my box, with my fancy multi-phase seccomp patches, the total
> seccomp overhead for a very short filter is about 13ns. Adding a full
> barrier would add several ns, I think.

I am just curious, does this 13ns overhead include the penalty from the
slow syscall enter path triggered by TIF_SECCOMP ?

> Admittedly, this is x86, not ARM, so comparisons here are completely
> bogus. And that read memory barrier doesn't even need an instruction
> on x86. But still, let's try to keep this fast.

Well, I am not going to insist...

But imo it would be better to make it correct in a most simple way, then
we can optimize this code and see if there is a noticeable difference.

Not only we can change the ordering, we can remove the BUG_ON's and just
accept the fact the __secure_computing() can race with sys_seccomp() from
another thread.

If nothing else, it would be much simpler to discuss this patch if it comes
as a separate change.

Oleg.

2014-06-27 20:08:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 12:55 PM, Oleg Nesterov <[email protected]> wrote:
> On 06/27, Andy Lutomirski wrote:
>>
>> On Fri, Jun 27, 2014 at 12:27 PM, Oleg Nesterov <[email protected]> wrote:
>> > On 06/27, Kees Cook wrote:
>> >>
>> >> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.htm
>> >>
>> >> ...
>> >>
>> >> I really want to avoid adding anything to the secure_computing()
>> >> execution path. :(
>> >
>> > I must have missed something but I do not understand your concerns.
>> >
>> > __secure_computing() is not trivial, and we are going to execute the
>> > filters. Do you really think rmb() can add the noticeable difference?
>> >
>> > Not to mention that we can only get here if we take the slow syscall
>> > enter path due to TIF_SECCOMP...
>> >
>>
>> On my box, with my fancy multi-phase seccomp patches, the total
>> seccomp overhead for a very short filter is about 13ns. Adding a full
>> barrier would add several ns, I think.
>
> I am just curious, does this 13ns overhead include the penalty from the
> slow syscall enter path triggered by TIF_SECCOMP ?

Yes, which is more or less the whole point of that patch series. I
rewrote part of the TIF_SECCOMP-but-no-tracing case in assembly :)

I'm playing with rewriting it in C, but it's looking like it'll be a
bit more far-reaching than I was hoping.

--Andy

2014-06-27 20:56:50

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 5/9] seccomp: split mode set routines

On Fri, Jun 27, 2014 at 12:55 PM, Oleg Nesterov <[email protected]> wrote:
> On 06/27, Andy Lutomirski wrote:
>>
>> On Fri, Jun 27, 2014 at 12:27 PM, Oleg Nesterov <[email protected]> wrote:
>> > On 06/27, Kees Cook wrote:
>> >>
>> >> It looks like SMP ARM issues dsb for rmb, which seems a bit expensive.
>> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.htm
>> >>
>> >> ...
>> >>
>> >> I really want to avoid adding anything to the secure_computing()
>> >> execution path. :(
>> >
>> > I must have missed something but I do not understand your concerns.
>> >
>> > __secure_computing() is not trivial, and we are going to execute the
>> > filters. Do you really think rmb() can add the noticeable difference?
>> >
>> > Not to mention that we can only get here if we take the slow syscall
>> > enter path due to TIF_SECCOMP...
>> >
>>
>> On my box, with my fancy multi-phase seccomp patches, the total
>> seccomp overhead for a very short filter is about 13ns. Adding a full
>> barrier would add several ns, I think.
>
> I am just curious, does this 13ns overhead include the penalty from the
> slow syscall enter path triggered by TIF_SECCOMP ?
>
>> Admittedly, this is x86, not ARM, so comparisons here are completely
>> bogus. And that read memory barrier doesn't even need an instruction
>> on x86. But still, let's try to keep this fast.
>
> Well, I am not going to insist...
>
> But imo it would be better to make it correct in a most simple way, then
> we can optimize this code and see if there is a noticeable difference.
>
> Not only we can change the ordering, we can remove the BUG_ON's and just
> accept the fact the __secure_computing() can race with sys_seccomp() from
> another thread.
>
> If nothing else, it would be much simpler to discuss this patch if it comes
> as a separate change.

Yeah, the way I want to go is to add the rmb() for now (it gets us
"correctness"), and then later deal with any potential performance
problems on ARM at a later time. (At present, I'm unable to perform
any timings on ARM -- maybe next week.)

I will send the v9 series in a moment here...

-Kees

--
Kees Cook
Chrome OS Security