2014-06-23 22:02:07

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 0/9] seccomp: add thread sync ability

This adds the ability for threads to request seccomp filter
synchronization across their thread group (at filter attach time).
For example, for Chrome to make sure graphic driver threads are fully
confined after seccomp filters have been attached.

To support this, locking on seccomp changes is introduced, along with
refactoring of no_new_privs. Races with thread creation/death are handled
via tasklist_lock.

This includes a new syscall (instead of adding a new prctl option),
as suggested by Andy Lutomirski and Michael Kerrisk.

Thanks!

-Kees

v7:
- rebase on Linus's tree (merged with network bpf changes)
- wrote manpage text documenting API (follows this series)
v6:
- switch from seccomp-specific lock to thread-group lock to gain atomicity
- implement seccomp syscall across all architectures with seccomp filter
- clean up sparse warnings around locking
v5:
- move includes around (drysdale)
- drop set_nnp return value (luto)
- use smp_load_acquire/store_release (luto)
- merge nnp changes to seccomp always, fewer ifdef (luto)
v4:
- cleaned up locking further, as noticed by David Drysdale
v3:
- added SECCOMP_EXT_ACT_FILTER for new filter install options
v2:
- reworked to avoid clone races


2014-06-23 21:59:05

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 8/9] ARM: add seccomp syscall

Wires up the new seccomp syscall.

Signed-off-by: Kees Cook <[email protected]>
---
arch/arm/include/uapi/asm/unistd.h | 1 +
arch/arm/kernel/calls.S | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index ba94446c72d9..e21b4a069701 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -409,6 +409,7 @@
#define __NR_sched_setattr (__NR_SYSCALL_BASE+380)
#define __NR_sched_getattr (__NR_SYSCALL_BASE+381)
#define __NR_renameat2 (__NR_SYSCALL_BASE+382)
+#define __NR_seccomp (__NR_SYSCALL_BASE+383)

/*
* This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 8f51bdcdacbb..bea85f97f363 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -392,6 +392,7 @@
/* 380 */ CALL(sys_sched_setattr)
CALL(sys_sched_getattr)
CALL(sys_renameat2)
+ CALL(sys_seccomp)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
--
1.7.9.5

2014-06-23 21:59:04

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 1/9] seccomp: create internal mode-setting function

In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 301bbc24739c..afb916c7e890 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -473,7 +473,7 @@ long prctl_get_seccomp(void)
}

/**
- * prctl_set_seccomp: configures current->seccomp.mode
+ * seccomp_set_mode: internal function for setting seccomp mode
* @seccomp_mode: requested mode to use
* @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
*
@@ -486,7 +486,7 @@ long prctl_get_seccomp(void)
*
* Returns 0 on success or -EINVAL on failure.
*/
-long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
{
long ret = -EINVAL;

@@ -517,3 +517,15 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
out:
return ret;
}
+
+/**
+ * prctl_set_seccomp: configures current->seccomp.mode
+ * @seccomp_mode: requested mode to use
+ * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+{
+ return seccomp_set_mode(seccomp_mode, filter);
+}
--
1.7.9.5

2014-06-23 21:59:54

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, changes must be atomic. This moves the nnp
flag into the seccomp field as a separate unsigned long for atomic access.

Signed-off-by: Kees Cook <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
---
fs/exec.c | 4 ++--
include/linux/sched.h | 13 ++++++++++---
include/linux/seccomp.h | 8 +++++++-
kernel/seccomp.c | 2 +-
kernel/sys.c | 4 ++--
security/apparmor/domain.c | 4 ++--
6 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a3d33fe592d6..0f5c272410f6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1234,7 +1234,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
* This isn't strictly necessary, but it makes it harder for LSMs to
* mess up.
*/
- if (current->no_new_privs)
+ if (task_no_new_privs(current))
bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;

t = p;
@@ -1272,7 +1272,7 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->egid = current_egid();

if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
- !current->no_new_privs &&
+ !task_no_new_privs(current) &&
kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
/* Set-uid? */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..f22c4735cead 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1307,9 +1307,6 @@ struct task_struct {
* execve */
unsigned in_iowait:1;

- /* task may not gain privileges */
- unsigned no_new_privs:1;
-
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
@@ -2529,6 +2526,16 @@ static inline void task_unlock(struct task_struct *p)
spin_unlock(&p->alloc_lock);
}

+static inline bool task_no_new_privs(struct task_struct *p)
+{
+ return test_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
+}
+
+static inline void task_set_no_new_privs(struct task_struct *p)
+{
+ set_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
+}
+
extern struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
unsigned long *flags);

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9ff98b4bfe2e..6a5e2d0ec912 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,6 +3,8 @@

#include <uapi/linux/seccomp.h>

+#define SECCOMP_FLAG_NO_NEW_PRIVS 0 /* task may not gain privs */
+
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
@@ -16,6 +18,7 @@ struct seccomp_filter;
* system calls available to a process.
* @filter: must always point to a valid seccomp-filter or NULL as it is
* accessed without locking during system call entry.
+ * @flags: flags under task->sighand->siglock lock
*
* @filter must only be accessed from the context of current as there
* is no read locking.
@@ -23,6 +26,7 @@ struct seccomp_filter;
struct seccomp {
int mode;
struct seccomp_filter *filter;
+ unsigned long flags;
};

extern int __secure_computing(int);
@@ -51,7 +55,9 @@ static inline int seccomp_mode(struct seccomp *s)

#include <linux/errno.h>

-struct seccomp { };
+struct seccomp {
+ unsigned long flags;
+};
struct seccomp_filter { };

static inline int secure_computing(int this_syscall) { return 0; }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 065ff5137e39..8ab0b7116ed8 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -217,7 +217,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
* This avoids scenarios where unprivileged tasks can affect the
* behavior of privileged children.
*/
- if (!current->no_new_privs &&
+ if (!task_no_new_privs(current) &&
security_capable_noaudit(current_cred(), current_user_ns(),
CAP_SYS_ADMIN) != 0)
return ERR_PTR(-EACCES);
diff --git a/kernel/sys.c b/kernel/sys.c
index 66a751ebf9d9..ce8129192a26 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1990,12 +1990,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 != 1 || arg3 || arg4 || arg5)
return -EINVAL;

- current->no_new_privs = 1;
+ task_set_no_new_privs(current);
break;
case PR_GET_NO_NEW_PRIVS:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
- return current->no_new_privs ? 1 : 0;
+ return task_no_new_privs(current) ? 1 : 0;
case PR_GET_THP_DISABLE:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 452567d3a08e..d97cba3e3849 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -621,7 +621,7 @@ int aa_change_hat(const char *hats[], int count, u64 token, bool permtest)
* There is no exception for unconfined as change_hat is not
* available.
*/
- if (current->no_new_privs)
+ if (task_no_new_privs(current))
return -EPERM;

/* released below */
@@ -776,7 +776,7 @@ int aa_change_profile(const char *ns_name, const char *hname, bool onexec,
* no_new_privs is set because this aways results in a reduction
* of permissions.
*/
- if (current->no_new_privs && !unconfined(profile)) {
+ if (task_no_new_privs(current) && !unconfined(profile)) {
put_cred(cred);
return -EPERM;
}
--
1.7.9.5

2014-06-23 21:59:52

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 6/9] seccomp: add "seccomp" syscall

This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag in the following patch, there is a non-zero
chance that this syscall could be used for configuring a fixed argument
area for seccomp-tracer-aware processes to pass syscall arguments in
the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook <[email protected]>
Cc: [email protected]
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 ++
include/uapi/asm-generic/unistd.h | 4 ++-
include/uapi/linux/seccomp.h | 4 +++
kernel/seccomp.c | 63 ++++++++++++++++++++++++++++++++-----
kernel/sys_ni.c | 3 ++
7 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b867921612..7527eac24122 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
353 i386 renameat2 sys_renameat2
+354 i386 seccomp sys_seccomp
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..16272a6c12b7 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common seccomp sys_seccomp

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..1713977ee26f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
+ const char __user *uargs);
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 333640608087..65acbf0e2867 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
__SYSCALL(__NR_sched_getattr, sys_sched_getattr)
#define __NR_renameat2 276
__SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_seccomp 277
+__SYSCALL(__NR_seccomp, sys_seccomp)

#undef __NR_syscalls
-#define __NR_syscalls 277
+#define __NR_syscalls 278

/*
* All syscalls below here should go away really,
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index ac2dc9f72973..b258878ba754 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,6 +10,10 @@
#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */

+/* Valid operations for seccomp syscall. */
+#define SECCOMP_SET_MODE_STRICT 0
+#define SECCOMP_SET_MODE_FILTER 1
+
/*
* All BPF programs must return a 32-bit value.
* The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1fb162e8b032..a0db770ff26c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -19,6 +19,7 @@
#include <linux/sched.h>
#include <linux/seccomp.h>
#include <linux/slab.h>
+#include <linux/syscalls.h>

/* #define SECCOMP_DEBUG 1 */

@@ -307,8 +308,8 @@ free_prog:
*
* Returns filter on success and ERR_PTR otherwise.
*/
-static
-struct seccomp_filter *seccomp_prepare_user_filter(char __user *user_filter)
+static struct seccomp_filter *
+seccomp_prepare_user_filter(const char __user *user_filter)
{
struct sock_fprog fprog;
struct seccomp_filter *filter = ERR_PTR(-EFAULT);
@@ -331,19 +332,25 @@ out:

/**
* seccomp_attach_filter: validate and attach filter
+ * @flags: flags to change filter behavior
* @filter: seccomp filter to add to the current process
*
* Caller must be holding current->sighand->siglock lock.
*
* Returns 0 on success, -ve on error.
*/
-static long seccomp_attach_filter(struct seccomp_filter *filter)
+static long seccomp_attach_filter(unsigned int flags,
+ struct seccomp_filter *filter)
{
unsigned long total_insns;
struct seccomp_filter *walker;

BUG_ON(!spin_is_locked(&current->sighand->siglock));

+ /* Validate flags. */
+ if (flags != 0)
+ return -EINVAL;
+
/* Validate resulting filter length. */
total_insns = filter->prog->len;
for (walker = current->seccomp.filter; walker; walker = walker->prev)
@@ -555,6 +562,7 @@ out:
#ifdef CONFIG_SECCOMP_FILTER
/**
* seccomp_set_mode_filter: internal function for setting seccomp filter
+ * @flags: flags to change filter behavior
* @filter: struct sock_fprog containing filter
*
* This function may be called repeatedly to install additional filters.
@@ -565,7 +573,8 @@ out:
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode_filter(char __user *filter)
+static long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
@@ -583,7 +592,7 @@ static long seccomp_set_mode_filter(char __user *filter)
if (!seccomp_check_mode(current, seccomp_mode))
goto out;

- ret = seccomp_attach_filter(prepared);
+ ret = seccomp_attach_filter(flags, prepared);
if (ret)
goto out;
/* Do not free the successfully attached filter. */
@@ -597,12 +606,35 @@ out_free:
return ret;
}
#else
-static inline long seccomp_set_mode_filter(char __user *filter)
+static inline long seccomp_set_mode_filter(unsigned int flags,
+ const char __user *filter)
{
return -EINVAL;
}
#endif

+/* Common entry point for both prctl and syscall. */
+static long do_seccomp(unsigned int op, unsigned int flags,
+ const char __user *uargs)
+{
+ switch (op) {
+ case SECCOMP_SET_MODE_STRICT:
+ if (flags != 0 || uargs != NULL)
+ return -EINVAL;
+ return seccomp_set_mode_strict();
+ case SECCOMP_SET_MODE_FILTER:
+ return seccomp_set_mode_filter(flags, uargs);
+ default:
+ return -EINVAL;
+ }
+}
+
+SYSCALL_DEFINE3(seccomp, unsigned int, op, unsigned int, flags,
+ const char __user *, uargs)
+{
+ return do_seccomp(op, flags, uargs);
+}
+
/**
* prctl_set_seccomp: configures current->seccomp.mode
* @seccomp_mode: requested mode to use
@@ -612,12 +644,27 @@ static inline long seccomp_set_mode_filter(char __user *filter)
*/
long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
+ unsigned int op;
+ char __user *uargs;
+
switch (seccomp_mode) {
case SECCOMP_MODE_STRICT:
- return seccomp_set_mode_strict();
+ op = SECCOMP_SET_MODE_STRICT;
+ /*
+ * Setting strict mode through prctl always ignored filter,
+ * so make sure it is always NULL here to pass the internal
+ * check in do_seccomp().
+ */
+ uargs = NULL;
+ break;
case SECCOMP_MODE_FILTER:
- return seccomp_set_mode_filter(filter);
+ op = SECCOMP_SET_MODE_FILTER;
+ uargs = filter;
+ break;
default:
return -EINVAL;
}
+
+ /* prctl interface doesn't have flags, so they are always zero. */
+ return do_seccomp(op, 0, uargs);
}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b51b5df..2904a2105914 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -213,3 +213,6 @@ cond_syscall(compat_sys_open_by_handle_at);

/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* operate on Secure Computing state */
+cond_syscall(sys_seccomp);
--
1.7.9.5

2014-06-23 21:59:50

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 5/9] seccomp: split mode set routines

Extracts the common check/assign logic, and separates the two mode
setting paths to make things more readable with fewer #ifdefs within
function bodies.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 124 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 85 insertions(+), 39 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8ab0b7116ed8..1fb162e8b032 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -193,7 +193,29 @@ static u32 seccomp_run_filters(int syscall)
}
return ret;
}
+#endif /* CONFIG_SECCOMP_FILTER */

+static inline bool seccomp_check_mode(struct task_struct *task,
+ unsigned long seccomp_mode)
+{
+ BUG_ON(!spin_is_locked(&task->sighand->siglock));
+
+ if (task->seccomp.mode && task->seccomp.mode != seccomp_mode)
+ return false;
+
+ return true;
+}
+
+static inline void seccomp_assign_mode(struct task_struct *task,
+ unsigned long seccomp_mode)
+{
+ BUG_ON(!spin_is_locked(&task->sighand->siglock));
+
+ task->seccomp.mode = seccomp_mode;
+ set_tsk_thread_flag(task, TIF_SECCOMP);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
/**
* seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
@@ -500,69 +522,86 @@ long prctl_get_seccomp(void)
}

/**
- * seccomp_set_mode: internal function for setting seccomp mode
- * @seccomp_mode: requested mode to use
- * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ * seccomp_set_mode_strict: internal function for setting strict seccomp
*
- * This function may be called repeatedly with a @seccomp_mode of
- * SECCOMP_MODE_FILTER to install additional filters. Every filter
- * successfully installed will be evaluated (in reverse order) for each system
- * call the task makes.
+ * Once current->seccomp.mode is non-zero, it may not be changed.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+static long seccomp_set_mode_strict(void)
+{
+ const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
+ unsigned long irqflags;
+ int ret = -EINVAL;
+
+ if (unlikely(!lock_task_sighand(current, &irqflags)))
+ return -EINVAL;
+
+ if (!seccomp_check_mode(current, seccomp_mode))
+ goto out;
+
+#ifdef TIF_NOTSC
+ disable_TSC();
+#endif
+ seccomp_assign_mode(current, seccomp_mode);
+ ret = 0;
+
+out:
+ unlock_task_sighand(current, &irqflags);
+
+ return ret;
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
+/**
+ * seccomp_set_mode_filter: internal function for setting seccomp filter
+ * @filter: struct sock_fprog containing filter
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the task makes.
*
* Once current->seccomp.mode is non-zero, it may not be changed.
*
* Returns 0 on success or -EINVAL on failure.
*/
-static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode_filter(char __user *filter)
{
+ const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
unsigned long irqflags;
long ret = -EINVAL;

-#ifdef CONFIG_SECCOMP_FILTER
- /* Prepare the new filter outside of the seccomp lock. */
- if (seccomp_mode == SECCOMP_MODE_FILTER) {
- prepared = seccomp_prepare_user_filter(filter);
- if (IS_ERR(prepared))
- return PTR_ERR(prepared);
- }
-#endif
+ /* Prepare the new filter outside of any locking. */
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);

if (unlikely(!lock_task_sighand(current, &irqflags)))
goto out_free;

- if (current->seccomp.mode &&
- current->seccomp.mode != seccomp_mode)
+ if (!seccomp_check_mode(current, seccomp_mode))
goto out;

- switch (seccomp_mode) {
- case SECCOMP_MODE_STRICT:
- ret = 0;
-#ifdef TIF_NOTSC
- disable_TSC();
-#endif
- break;
-#ifdef CONFIG_SECCOMP_FILTER
- case SECCOMP_MODE_FILTER:
- ret = seccomp_attach_filter(prepared);
- if (ret)
- goto out;
- /* Do not free the successfully attached filter. */
- prepared = NULL;
- break;
-#endif
- default:
+ ret = seccomp_attach_filter(prepared);
+ if (ret)
goto out;
- }
+ /* Do not free the successfully attached filter. */
+ prepared = NULL;

- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ seccomp_assign_mode(current, seccomp_mode);
out:
unlock_task_sighand(current, &irqflags);
out_free:
seccomp_filter_free(prepared);
return ret;
}
+#else
+static inline long seccomp_set_mode_filter(char __user *filter)
+{
+ return -EINVAL;
+}
+#endif

/**
* prctl_set_seccomp: configures current->seccomp.mode
@@ -573,5 +612,12 @@ out_free:
*/
long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
- return seccomp_set_mode(seccomp_mode, filter);
+ switch (seccomp_mode) {
+ case SECCOMP_MODE_STRICT:
+ return seccomp_set_mode_strict();
+ case SECCOMP_MODE_FILTER:
+ return seccomp_set_mode_filter(filter);
+ default:
+ return -EINVAL;
+ }
}
--
1.7.9.5

2014-06-23 22:00:53

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 2/9] seccomp: split filter prep from check and apply

In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 97 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 67 insertions(+), 30 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index afb916c7e890..edc8c79ed16d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -18,6 +18,7 @@
#include <linux/compat.h>
#include <linux/sched.h>
#include <linux/seccomp.h>
+#include <linux/slab.h>

/* #define SECCOMP_DEBUG 1 */

@@ -26,7 +27,6 @@
#include <linux/filter.h>
#include <linux/ptrace.h>
#include <linux/security.h>
-#include <linux/slab.h>
#include <linux/tracehook.h>
#include <linux/uaccess.h>

@@ -196,27 +196,21 @@ static u32 seccomp_run_filters(int syscall)
}

/**
- * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
*
- * Returns 0 on success or an errno on failure.
+ * Returns filter on success or an ERR_PTR on failure.
*/
-static long seccomp_attach_filter(struct sock_fprog *fprog)
+static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
struct seccomp_filter *filter;
unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
- unsigned long total_insns = fprog->len;
struct sock_filter *fp;
int new_len;
long ret;

if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
- return -EINVAL;
-
- for (filter = current->seccomp.filter; filter; filter = filter->prev)
- total_insns += filter->prog->len + 4; /* include a 4 instr penalty */
- if (total_insns > MAX_INSNS_PER_PATH)
- return -ENOMEM;
+ return ERR_PTR(-EINVAL);

/*
* Installing a seccomp filter requires that the task has
@@ -227,11 +221,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
if (!current->no_new_privs &&
security_capable_noaudit(current_cred(), current_user_ns(),
CAP_SYS_ADMIN) != 0)
- return -EACCES;
+ return ERR_PTR(-EACCES);

fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
if (!fp)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);

/* Copy the instructions from fprog. */
ret = -EFAULT;
@@ -275,13 +269,7 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)

sk_filter_select_runtime(filter->prog);

- /*
- * If there is an existing filter, make it the prev and don't drop its
- * task reference.
- */
- filter->prev = current->seccomp.filter;
- current->seccomp.filter = filter;
- return 0;
+ return filter;

free_filter_prog:
kfree(filter->prog);
@@ -289,19 +277,20 @@ free_filter:
kfree(filter);
free_prog:
kfree(fp);
- return ret;
+ return ERR_PTR(ret);
}

/**
- * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * seccomp_prepare_user_filter - prepares a user-supplied sock_fprog
* @user_filter: pointer to the user data containing a sock_fprog.
*
- * Returns 0 on success and non-zero otherwise.
+ * Returns filter on success and ERR_PTR otherwise.
*/
-static long seccomp_attach_user_filter(char __user *user_filter)
+static
+struct seccomp_filter *seccomp_prepare_user_filter(char __user *user_filter)
{
struct sock_fprog fprog;
- long ret = -EFAULT;
+ struct seccomp_filter *filter = ERR_PTR(-EFAULT);

#ifdef CONFIG_COMPAT
if (is_compat_task()) {
@@ -314,9 +303,37 @@ static long seccomp_attach_user_filter(char __user *user_filter)
#endif
if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
goto out;
- ret = seccomp_attach_filter(&fprog);
+ filter = seccomp_prepare_filter(&fprog);
out:
- return ret;
+ return filter;
+}
+
+/**
+ * seccomp_attach_filter: validate and attach filter
+ * @filter: seccomp filter to add to the current process
+ *
+ * Returns 0 on success, -ve on error.
+ */
+static long seccomp_attach_filter(struct seccomp_filter *filter)
+{
+ unsigned long total_insns;
+ struct seccomp_filter *walker;
+
+ /* Validate resulting filter length. */
+ total_insns = filter->prog->len;
+ for (walker = current->seccomp.filter; walker; walker = walker->prev)
+ total_insns += walker->prog->len + 4; /* 4 instr penalty */
+ if (total_insns > MAX_INSNS_PER_PATH)
+ return -ENOMEM;
+
+ /*
+ * If there is an existing filter, make it the prev and don't drop its
+ * task reference.
+ */
+ filter->prev = current->seccomp.filter;
+ current->seccomp.filter = filter;
+
+ return 0;
}

/* get_seccomp_filter - increments the reference count of the filter on @tsk */
@@ -329,6 +346,14 @@ void get_seccomp_filter(struct task_struct *tsk)
atomic_inc(&orig->usage);
}

+static inline void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (filter) {
+ sk_filter_free(filter->prog);
+ kfree(filter);
+ }
+}
+
/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
void put_seccomp_filter(struct task_struct *tsk)
{
@@ -337,8 +362,7 @@ void put_seccomp_filter(struct task_struct *tsk)
while (orig && atomic_dec_and_test(&orig->usage)) {
struct seccomp_filter *freeme = orig;
orig = orig->prev;
- sk_filter_free(freeme->prog);
- kfree(freeme);
+ seccomp_filter_free(freeme);
}
}

@@ -488,8 +512,18 @@ long prctl_get_seccomp(void)
*/
static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
{
+ struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;

+#ifdef CONFIG_SECCOMP_FILTER
+ /* Prepare the new filter outside of the seccomp lock. */
+ if (seccomp_mode == SECCOMP_MODE_FILTER) {
+ prepared = seccomp_prepare_user_filter(filter);
+ if (IS_ERR(prepared))
+ return PTR_ERR(prepared);
+ }
+#endif
+
if (current->seccomp.mode &&
current->seccomp.mode != seccomp_mode)
goto out;
@@ -503,9 +537,11 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
break;
#ifdef CONFIG_SECCOMP_FILTER
case SECCOMP_MODE_FILTER:
- ret = seccomp_attach_user_filter(filter);
+ ret = seccomp_attach_filter(prepared);
if (ret)
goto out;
+ /* Do not free the successfully attached filter. */
+ prepared = NULL;
break;
#endif
default:
@@ -515,6 +551,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
current->seccomp.mode = seccomp_mode;
set_thread_flag(TIF_SECCOMP);
out:
+ seccomp_filter_free(prepared);
return ret;
}

--
1.7.9.5

2014-06-23 22:00:52

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions are against another thread requesting TSYNC, another
thread performing a clone, and another thread changing its filter. The
sighand lock is sufficient for these cases, though the clone case is
assisted by the tasklist_lock so that new threads must have a duplicate
of its parent seccomp state when it appears on the tasklist.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
---
arch/Kconfig | 1 +
include/linux/seccomp.h | 2 +
include/uapi/linux/seccomp.h | 3 +
kernel/seccomp.c | 135 +++++++++++++++++++++++++++++++++++++++++-
4 files changed, 139 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 97ff872c7acc..0eae9df35b88 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -321,6 +321,7 @@ config HAVE_ARCH_SECCOMP_FILTER
- secure_computing is called from a ptrace_event()-safe context
- secure_computing return value is checked and a return value of -1
results in the system call being skipped immediately.
+ - seccomp syscall wired up

config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6a5e2d0ec912..e1ae50f7cb0b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,6 +5,8 @@

#define SECCOMP_FLAG_NO_NEW_PRIVS 0 /* task may not gain privs */

+#define SECCOMP_FILTER_FLAG_MASK ~(SECCOMP_FILTER_FLAG_TSYNC)
+
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index b258878ba754..0f238a43ff1e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -14,6 +14,9 @@
#define SECCOMP_SET_MODE_STRICT 0
#define SECCOMP_SET_MODE_FILTER 1

+/* Valid flags for SECCOMP_SET_MODE_FILTER */
+#define SECCOMP_FILTER_FLAG_TSYNC 1
+
/*
* All BPF programs must return a 32-bit value.
* The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index a0db770ff26c..ed3a4453c054 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,7 @@
#ifdef CONFIG_SECCOMP_FILTER
#include <asm/syscall.h>
#include <linux/filter.h>
+#include <linux/pid.h>
#include <linux/ptrace.h>
#include <linux/security.h>
#include <linux/tracehook.h>
@@ -217,6 +218,107 @@ static inline void seccomp_assign_mode(struct task_struct *task,
}

#ifdef CONFIG_SECCOMP_FILTER
+/* Returns 1 if the candidate is an ancestor. */
+static int is_ancestor(struct seccomp_filter *candidate,
+ struct seccomp_filter *child)
+{
+ /* NULL is the root ancestor. */
+ if (candidate == NULL)
+ return 1;
+ for (; child; child = child->prev)
+ if (child == candidate)
+ return 1;
+ return 0;
+}
+
+/**
+ * seccomp_can_sync_threads: checks if all threads can be synchronized
+ *
+ * Expects both tasklist_lock and current->sighand->siglock to be held.
+ *
+ * Returns 0 on success, -ve on error, or the pid of a thread which was
+ * either not in the correct seccomp mode or it did not have an ancestral
+ * seccomp filter.
+ */
+static pid_t seccomp_can_sync_threads(void)
+{
+ struct task_struct *thread, *caller;
+
+ BUG_ON(write_can_lock(&tasklist_lock));
+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+ if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+ return -EACCES;
+
+ /* Validate all threads being eligible for synchronization. */
+ thread = caller = current;
+ for_each_thread(caller, thread) {
+ pid_t failed;
+
+ if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
+ (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
+ is_ancestor(thread->seccomp.filter,
+ caller->seccomp.filter)))
+ continue;
+
+ /* Return the first thread that cannot be synchronized. */
+ failed = task_pid_vnr(thread);
+ /* If the pid cannot be resolved, then return -ESRCH */
+ if (failed == 0)
+ failed = -ESRCH;
+ return failed;
+ }
+
+ return 0;
+}
+
+/**
+ * seccomp_sync_threads: sets all threads to use current's filter
+ *
+ * Expects both tasklist_lock and current->sighand->siglock to be held,
+ * and for seccomp_can_sync_threads() to have returned success already
+ * without dropping the locks.
+ *
+ */
+static void seccomp_sync_threads(void)
+{
+ struct task_struct *thread, *caller;
+
+ BUG_ON(write_can_lock(&tasklist_lock));
+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+ /* Synchronize all threads. */
+ thread = caller = current;
+ for_each_thread(caller, thread) {
+ /* Get a task reference for the new leaf node. */
+ get_seccomp_filter(caller);
+ /*
+ * Drop the task reference to the shared ancestor since
+ * current's path will hold a reference. (This also
+ * allows a put before the assignment.)
+ */
+ put_seccomp_filter(thread);
+ thread->seccomp.filter = caller->seccomp.filter;
+ /* Opt the other thread into seccomp if needed.
+ * As threads are considered to be trust-realm
+ * equivalent (see ptrace_may_access), it is safe to
+ * allow one thread to transition the other.
+ */
+ if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
+ /*
+ * Don't let an unprivileged task work around
+ * the no_new_privs restriction by creating
+ * a thread that sets it up, enters seccomp,
+ * then dies.
+ */
+ if (task_no_new_privs(caller))
+ task_set_no_new_privs(thread);
+
+ seccomp_assign_mode(thread, SECCOMP_MODE_FILTER);
+ }
+ }
+}
+
/**
* seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
@@ -348,7 +450,7 @@ static long seccomp_attach_filter(unsigned int flags,
BUG_ON(!spin_is_locked(&current->sighand->siglock));

/* Validate flags. */
- if (flags != 0)
+ if (flags & SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;

/* Validate resulting filter length. */
@@ -358,6 +460,15 @@ static long seccomp_attach_filter(unsigned int flags,
if (total_insns > MAX_INSNS_PER_PATH)
return -ENOMEM;

+ /* If thread sync has been requested, check that it is possible. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ int ret;
+
+ ret = seccomp_can_sync_threads();
+ if (ret)
+ return ret;
+ }
+
/*
* If there is an existing filter, make it the prev and don't drop its
* task reference.
@@ -365,6 +476,10 @@ static long seccomp_attach_filter(unsigned int flags,
filter->prev = current->seccomp.filter;
smp_store_release(&current->seccomp.filter, filter);

+ /* Now that the new filter is in place, synchronize to all threads. */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ seccomp_sync_threads();
+
return 0;
}

@@ -578,7 +693,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
{
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
- unsigned long irqflags;
+ unsigned long irqflags, taskflags;
long ret = -EINVAL;

/* Prepare the new filter outside of any locking. */
@@ -586,6 +701,17 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (IS_ERR(prepared))
return PTR_ERR(prepared);

+ /*
+ * If we're doing thread sync, we must hold tasklist_lock
+ * to make sure seccomp filter changes are stable on threads
+ * entering or leaving the task list. And we must take it
+ * before the sighand lock to avoid deadlocking.
+ */
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ write_lock_irqsave(&tasklist_lock, taskflags);
+ else
+ __acquire(&tasklist_lock); /* keep sparse happy */
+
if (unlikely(!lock_task_sighand(current, &irqflags)))
goto out_free;

@@ -602,6 +728,11 @@ static long seccomp_set_mode_filter(unsigned int flags,
out:
unlock_task_sighand(current, &irqflags);
out_free:
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+ write_unlock_irqrestore(&tasklist_lock, taskflags);
+ else
+ __release(&tasklist_lock); /* keep sparse happy */
+
seccomp_filter_free(prepared);
return ret;
}
--
1.7.9.5

2014-06-23 22:02:05

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 3/9] seccomp: introduce writer locking

Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to
the seccomp fields are now protected by the sighand spinlock (which
is unique to the thread group). Read access remains lockless because
pointer updates themselves are atomic. However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the tasklist_lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <[email protected]>
---
include/linux/seccomp.h | 6 +++---
kernel/fork.c | 40 ++++++++++++++++++++++++++++++++++++----
kernel/seccomp.c | 23 ++++++++++++++++-------
3 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..9ff98b4bfe2e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,11 +14,11 @@ struct seccomp_filter;
*
* @mode: indicates one of the valid values above for controlled
* system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- * are allowed for a task.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ * accessed without locking during system call entry.
*
* @filter must only be accessed from the context of current as there
- * is no locking.
+ * is no read locking.
*/
struct seccomp {
int mode;
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1fc952..6b2a9add1079 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
goto free_ti;

tsk->stack = ti;
+#ifdef CONFIG_SECCOMP
+ /*
+ * We must handle setting up seccomp filters once we're under
+ * the tasklist_lock in case orig has changed between now and
+ * then. Until then, filter must be NULL to avoid messing up
+ * the usage counts on the error path calling free_task.
+ */
+ tsk->seccomp.filter = NULL;
+#endif

setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
@@ -1081,6 +1090,23 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}

+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+ /*
+ * Must be called with sighand->lock held. Child lock not needed
+ * since it is not yet in tasklist.
+ */
+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
+ get_seccomp_filter(current);
+ p->seccomp = current->seccomp;
+
+ if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
+ set_tsk_thread_flag(p, TIF_SECCOMP);
+#endif
+}
+
SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
{
current->clear_child_tid = tidptr;
@@ -1142,6 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
{
int retval;
struct task_struct *p;
+ unsigned long irqflags;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

ftrace_graph_init_task(p);
- get_seccomp_filter(p);

rt_mutex_init_task(p);

@@ -1434,7 +1460,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->parent_exec_id = current->self_exec_id;
}

- spin_lock(&current->sighand->siglock);
+ spin_lock_irqsave(&current->sighand->siglock, irqflags);
+
+ /*
+ * Copy seccomp details explicitly here, in case they were changed
+ * before holding tasklist_lock.
+ */
+ copy_seccomp(p);

/*
* Process group and session signals need to be delivered to just the
@@ -1446,7 +1478,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
*/
recalc_sigpending();
if (signal_pending(current)) {
- spin_unlock(&current->sighand->siglock);
+ spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
write_unlock_irq(&tasklist_lock);
retval = -ERESTARTNOINTR;
goto bad_fork_free_pid;
@@ -1486,7 +1518,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
}

total_forks++;
- spin_unlock(&current->sighand->siglock);
+ spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
write_unlock_irq(&tasklist_lock);
proc_fork_connector(p);
cgroup_post_fork(p);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index edc8c79ed16d..065ff5137e39 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -172,12 +172,12 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
*/
static u32 seccomp_run_filters(int syscall)
{
- struct seccomp_filter *f;
+ struct seccomp_filter *f = smp_load_acquire(&current->seccomp.filter);
struct seccomp_data sd;
u32 ret = SECCOMP_RET_ALLOW;

/* Ensure unexpected behavior doesn't result in failing open. */
- if (WARN_ON(current->seccomp.filter == NULL))
+ if (WARN_ON(f == NULL))
return SECCOMP_RET_KILL;

populate_seccomp_data(&sd);
@@ -186,9 +186,8 @@ static u32 seccomp_run_filters(int syscall)
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
- for (f = current->seccomp.filter; f; f = f->prev) {
+ for (; f; f = smp_load_acquire(&f->prev)) {
u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);
-
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
ret = cur_ret;
}
@@ -312,6 +311,8 @@ out:
* seccomp_attach_filter: validate and attach filter
* @filter: seccomp filter to add to the current process
*
+ * Caller must be holding current->sighand->siglock lock.
+ *
* Returns 0 on success, -ve on error.
*/
static long seccomp_attach_filter(struct seccomp_filter *filter)
@@ -319,6 +320,8 @@ static long seccomp_attach_filter(struct seccomp_filter *filter)
unsigned long total_insns;
struct seccomp_filter *walker;

+ BUG_ON(!spin_is_locked(&current->sighand->siglock));
+
/* Validate resulting filter length. */
total_insns = filter->prog->len;
for (walker = current->seccomp.filter; walker; walker = walker->prev)
@@ -331,7 +334,7 @@ static long seccomp_attach_filter(struct seccomp_filter *filter)
* task reference.
*/
filter->prev = current->seccomp.filter;
- current->seccomp.filter = filter;
+ smp_store_release(&current->seccomp.filter, filter);

return 0;
}
@@ -339,7 +342,7 @@ static long seccomp_attach_filter(struct seccomp_filter *filter)
/* get_seccomp_filter - increments the reference count of the filter on @tsk */
void get_seccomp_filter(struct task_struct *tsk)
{
- struct seccomp_filter *orig = tsk->seccomp.filter;
+ struct seccomp_filter *orig = smp_load_acquire(&tsk->seccomp.filter);
if (!orig)
return;
/* Reference count is bounded by the number of total processes. */
@@ -361,7 +364,7 @@ void put_seccomp_filter(struct task_struct *tsk)
/* Clean up single-reference branches iteratively. */
while (orig && atomic_dec_and_test(&orig->usage)) {
struct seccomp_filter *freeme = orig;
- orig = orig->prev;
+ orig = smp_load_acquire(&orig->prev);
seccomp_filter_free(freeme);
}
}
@@ -513,6 +516,7 @@ long prctl_get_seccomp(void)
static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
{
struct seccomp_filter *prepared = NULL;
+ unsigned long irqflags;
long ret = -EINVAL;

#ifdef CONFIG_SECCOMP_FILTER
@@ -524,6 +528,9 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
}
#endif

+ if (unlikely(!lock_task_sighand(current, &irqflags)))
+ goto out_free;
+
if (current->seccomp.mode &&
current->seccomp.mode != seccomp_mode)
goto out;
@@ -551,6 +558,8 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
current->seccomp.mode = seccomp_mode;
set_thread_flag(TIF_SECCOMP);
out:
+ unlock_task_sighand(current, &irqflags);
+out_free:
seccomp_filter_free(prepared);
return ret;
}
--
1.7.9.5

2014-06-23 22:02:03

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 9/9] MIPS: add seccomp syscall

Wires up the new seccomp syscall.

Signed-off-by: Kees Cook <[email protected]>
---
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index 5805414777e0..9bc13eaf9d67 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -372,16 +372,17 @@
#define __NR_sched_setattr (__NR_Linux + 349)
#define __NR_sched_getattr (__NR_Linux + 350)
#define __NR_renameat2 (__NR_Linux + 351)
+#define __NR_seccomp (__NR_Linux + 352)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 351
+#define __NR_Linux_syscalls 352

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 351
+#define __NR_O32_Linux_syscalls 352

#if _MIPS_SIM == _MIPS_SIM_ABI64

@@ -701,16 +702,17 @@
#define __NR_sched_setattr (__NR_Linux + 309)
#define __NR_sched_getattr (__NR_Linux + 310)
#define __NR_renameat2 (__NR_Linux + 311)
+#define __NR_seccomp (__NR_Linux + 312)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 311
+#define __NR_Linux_syscalls 312

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 311
+#define __NR_64_Linux_syscalls 312

#if _MIPS_SIM == _MIPS_SIM_NABI32

@@ -1034,15 +1036,16 @@
#define __NR_sched_setattr (__NR_Linux + 313)
#define __NR_sched_getattr (__NR_Linux + 314)
#define __NR_renameat2 (__NR_Linux + 315)
+#define __NR_seccomp (__NR_Linux + 316)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 315
+#define __NR_Linux_syscalls 316

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 315
+#define __NR_N32_Linux_syscalls 316

#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 3245474f19d5..ab02d14f1b5c 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -578,3 +578,4 @@ EXPORT(sys_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr /* 4350 */
PTR sys_renameat2
+ PTR sys_seccomp
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index be2fedd4ae33..010dccf128ec 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -431,4 +431,5 @@ EXPORT(sys_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr /* 5310 */
PTR sys_renameat2
+ PTR sys_seccomp
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index c1dbcda4b816..c3b3b6525df5 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -424,4 +424,5 @@ EXPORT(sysn32_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr
PTR sys_renameat2 /* 6315 */
+ PTR sys_seccomp
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index f1343ccd7ed7..bb1550b1f501 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -557,4 +557,5 @@ EXPORT(sys32_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr /* 4350 */
PTR sys_renameat2
+ PTR sys_seccomp
.size sys32_call_table,.-sys32_call_table
--
1.7.9.5

2014-06-23 23:29:55

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7 1/1] man-pages: seccomp.2: document syscall

Combines documentation from prctl, and in-kernel seccomp_filter.txt,
along with new details specific to the new syscall.

Signed-off-by: Kees Cook <[email protected]>
---
man2/seccomp.2 | 333 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 333 insertions(+)
create mode 100644 man2/seccomp.2

diff --git a/man2/seccomp.2 b/man2/seccomp.2
new file mode 100644
index 0000000..de7fbf7
--- /dev/null
+++ b/man2/seccomp.2
@@ -0,0 +1,333 @@
+.\" Copyright (C) 2014 Kees Cook <[email protected]>
+.\" and Copyright (C) 2012 Will Drewry <[email protected]>
+.\" and Copyright (C) 2008 Michael Kerrisk <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
+.SH NAME
+seccomp \-
+operate on Secure Computing state of the process
+.SH SYNOPSIS
+.nf
+.B #include <linux/seccomp.h>
+.B #include <linux/filter.h>
+.B #include <linux/audit.h>
+.B #include <linux/signal.h>
+.B #include <sys/ptrace.h>
+
+.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
+.BI " unsigned char *" args );
+.fi
+.SH DESCRIPTION
+The
+.BR seccomp ()
+system call operates on the Secure Computing (seccomp) state of the
+current process.
+
+Currently, Linux supports the following
+.IR operation
+values:
+.TP
+.BR SECCOMP_SET_MODE_STRICT
+Only system calls that the thread is permitted to make are
+.BR read (2),
+.BR write (2),
+.BR _exit (2),
+and
+.BR sigreturn (2).
+Other system calls result in the delivery of a
+.BR SIGKILL
+signal. Strict secure computing mode is useful for number-crunching
+applications that may need to execute untrusted byte code, perhaps
+obtained by reading from a pipe or socket.
+
+This operation is available only if the kernel is configured with
+.BR CONFIG_SECCOMP
+enabled.
+
+The value of
+.IR flags
+must be 0, and
+.IR args
+must be NULL.
+
+This operation is functionally identical to calling
+.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" .
+.TP
+.BR SECCOMP_SET_MODE_FILTER
+The system calls allowed are defined by a pointer to a Berkeley Packet
+Filter (BPF) passed via
+.IR args .
+This argument is a pointer to
+.IR "struct\ sock_fprog" ;
+it can be designed to filter arbitrary system calls and system call
+arguments. If the filter is invalid, the call will fail, returning
+.BR EACCESS
+in
+.IR errno .
+
+If
+.BR fork (2),
+.BR clone (2),
+or
+.BR execve (2)
+are allowed by the filter, any child processes will be constrained to
+the same filters and system calls as the parent.
+
+Prior to using this operation, the process must call
+.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)"
+or run with
+.BR CAP_SYS_ADMIN
+privileges in its namespace. If these are not true, the call will fail
+and return
+.BR EACCES
+in
+.IR errno .
+This requirement ensures that filter programs cannot be applied to child
+processes with greater privileges than the process that installed them.
+
+Additionally, if
+.BR prctl (2)
+or
+.BR seccomp (2)
+is allowed by the attached filter, additional filters may be layered on
+which will increase evaluation time, but allow for further reduction of
+the attack surface during execution of a process.
+
+This operation is available only if the kernel is configured with
+.BR CONFIG_SECCOMP_FILTER
+enabled.
+
+When
+.IR flags
+are 0, this operation is functionally identical to calling
+.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" .
+
+The recognized
+.IR flags
+are:
+.RS
+.TP
+.BR SECCOMP_FILTER_FLAG_TSYNC
+When adding a new filter, synchronize all other threads of the current
+process to the same seccomp filter tree. If any thread cannot do this,
+the call will not attach the new seccomp filter, and will fail returning
+the first thread ID found that cannot synchronize. Synchronization will
+fail if another thread is in
+.BR SECCOMP_MODE_STRICT
+or if it has attached new seccomp filters to itself, diverging from the
+calling thread's filter tree.
+.RE
+.SH FILTERS
+When adding filters via
+.BR SECCOMP_SET_MODE_FILTER ,
+.IR args
+points to a filter program:
+
+.in +4n
+.nf
+struct sock_fprog {
+ unsigned short len; /* Number of filter blocks */
+ struct sock_filter *filter;
+};
+.fi
+.in
+
+Each program must contain one or more BPF instructions:
+
+.in +4n
+.nf
+struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+};
+.fi
+.in
+
+When executing the instructions, the BPF program executes over the
+syscall information made available via:
+
+.in +4n
+.nf
+struct seccomp_data {
+ int nr; /* system call number */
+ __u32 arch; /* AUDIT_ARCH_* value */
+ __u64 instruction_pointer; /* CPU instruction pointer */
+ __u64 args[6]; /* up to 6 system call arguments */
+};
+.fi
+.in
+
+A seccomp filter may return any of the following values. If multiple
+filters exist, the return value for the evaluation of a given system
+call will always use the highest precedent value. (For example,
+.BR SECCOMP_RET_KILL
+will always take precedence.)
+
+In precedence order, they are:
+.TP
+.BR SECCOMP_RET_KILL
+Results in the task exiting immediately without executing the
+system call. The exit status of the task (status & 0x7f) will
+be
+.BR SIGSYS ,
+not
+.BR SIGKILL .
+.TP
+.BR SECCOMP_RET_TRAP
+Results in the kernel sending a
+.BR SIGSYS
+signal to the triggering task without executing the system call.
+.IR siginfo\->si_call_addr
+will show the address of the system call instruction, and
+.IR siginfo\->si_syscall
+and
+.IR siginfo\->si_arch
+will indicate which syscall was attempted. The program counter will be
+as though the syscall happened (i.e. it will not point to the syscall
+instruction). The return value register will contain an arch\-dependent
+value; if resuming execution, set it to something sensible.
+(The architecture dependency is because replacing it with
+.BR ENOSYS
+could overwrite some useful information.)
+
+The
+.BR SECCOMP_RET_DATA
+portion of the return value will be passed as
+.IR si_errno .
+
+.BR SIGSYS
+triggered by seccomp will have a
+.IR si_code
+of
+.BR SYS_SECCOMP .
+.TP
+.BR SECCOMP_RET_ERRNO
+Results in the lower 16-bits of the return value being passed
+to userland as the
+.IR errno
+without executing the system call.
+.TP
+.BR SECCOMP_RET_TRACE
+When returned, this value will cause the kernel to attempt to
+notify a ptrace()-based tracer prior to executing the system
+call. If there is no tracer present,
+.BR ENOSYS
+is returned to userland and the system call is not executed.
+
+A tracer will be notified if it requests
+.BR PTRACE_O_TRACESECCOMP
+using
+.IR ptrace(PTRACE_SETOPTIONS) .
+The tracer will be notified of a
+.BR PTRACE_EVENT_SECCOMP
+and the
+.BR SECCOMP_RET_DATA
+portion of the BPF program return value will be available to the tracer
+via
+.BR PTRACE_GETEVENTMSG .
+
+The tracer can skip the system call by changing the syscall number
+to \-1. Alternatively, the tracer can change the system call
+requested by changing the system call to a valid syscall number. If
+the tracer asks to skip the system call, then the system call will
+appear to return the value that the tracer puts in the return value
+register.
+
+The seccomp check will not be run again after the tracer is
+notified. (This means that seccomp-based sandboxes MUST NOT
+allow use of ptrace, even of other sandboxed processes, without
+extreme care; ptracers can use this mechanism to escape.)
+.TP
+.BR SECCOMP_RET_ALLOW
+Results in the system call being executed.
+
+If multiple filters exist, the return value for the evaluation of a
+given system call will always use the highest precedent value.
+
+Precedence is only determined using the
+.BR SECCOMP_RET_ACTION
+mask. When multiple filters return values of the same precedence,
+only the
+.BR SECCOMP_RET_DATA
+from the most recently installed filter will be returned.
+.SH RETURN VALUE
+On success,
+.BR seccomp ()
+returns 0.
+On error, if
+.BR SECCOMP_FILTER_FLAG_TSYNC
+was used, the return value is the thread ID that caused the
+synchronization failure. On other errors, \-1 is returned, and
+.IR errno
+is set to indicate the cause of the error.
+.SH ERRORS
+.BR seccomp ()
+can fail for the following reasons:
+.TP
+.BR EACCESS
+the caller did not have the
+.BR CAP_SYS_ADMIN
+capability, or had not set
+.IR no_new_privs
+before using
+.BR SECCOMP_SET_MODE_FILTER .
+.TP
+.BR EFAULT
+.IR args
+was required to be a valid address.
+.TP
+.BR EINVAL
+.IR operation
+is unknown; or
+.IR flags
+are invalid for the given
+.IR operation
+.TP
+.BR ESRCH
+Another thread caused a failure during thread sync, but its ID could not
+be determined.
+.SH VERSIONS
+This system call first appeared in Linux 3.16.
+.\" FIXME Add glibc version
+.SH CONFORMING TO
+This system call is a nonstandard Linux extensions.
+.SH NOTES
+.BR seccomp ()
+provides a superset of the functionality provided by
+.IR PR_SET_SECCOMP
+of
+.BR prctl (2) .
+(Which does not support
+.IR flags .)
+.SH SEE ALSO
+.ad l
+.nh
+.BR prctl (2),
+.BR ptrace (2),
+.BR signal (7)
+.BR socket (7)
+.ad
--
1.7.9.5


--
Kees Cook @outflux.net

Subject: Re: [PATCH v7 1/1] man-pages: seccomp.2: document syscall

On 06/24/2014 12:01 AM, Kees Cook wrote:
> Combines documentation from prctl, and in-kernel seccomp_filter.txt,
> along with new details specific to the new syscall.

Great work on the man page, Kees! (BTW, just looking at the complexity detailed
there further supports the decision to grant this functionality as a separate
syscall, rather than multiplexed into prctl(2).

Would there be some suitable, not too long program that we
could put in the man page as an example for using filters?

Cheers,

Michael


> Signed-off-by: Kees Cook <[email protected]>
> ---
> man2/seccomp.2 | 333 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 333 insertions(+)
> create mode 100644 man2/seccomp.2
>
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> new file mode 100644
> index 0000000..de7fbf7
> --- /dev/null
> +++ b/man2/seccomp.2
> @@ -0,0 +1,333 @@
> +.\" Copyright (C) 2014 Kees Cook <[email protected]>
> +.\" and Copyright (C) 2012 Will Drewry <[email protected]>
> +.\" and Copyright (C) 2008 Michael Kerrisk <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +seccomp \-
> +operate on Secure Computing state of the process
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/seccomp.h>
> +.B #include <linux/filter.h>
> +.B #include <linux/audit.h>
> +.B #include <linux/signal.h>
> +.B #include <sys/ptrace.h>
> +
> +.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
> +.BI " unsigned char *" args );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR seccomp ()
> +system call operates on the Secure Computing (seccomp) state of the
> +current process.
> +
> +Currently, Linux supports the following
> +.IR operation
> +values:
> +.TP
> +.BR SECCOMP_SET_MODE_STRICT
> +Only system calls that the thread is permitted to make are
> +.BR read (2),
> +.BR write (2),
> +.BR _exit (2),
> +and
> +.BR sigreturn (2).
> +Other system calls result in the delivery of a
> +.BR SIGKILL
> +signal. Strict secure computing mode is useful for number-crunching
> +applications that may need to execute untrusted byte code, perhaps
> +obtained by reading from a pipe or socket.
> +
> +This operation is available only if the kernel is configured with
> +.BR CONFIG_SECCOMP
> +enabled.
> +
> +The value of
> +.IR flags
> +must be 0, and
> +.IR args
> +must be NULL.
> +
> +This operation is functionally identical to calling
> +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" .
> +.TP
> +.BR SECCOMP_SET_MODE_FILTER
> +The system calls allowed are defined by a pointer to a Berkeley Packet
> +Filter (BPF) passed via
> +.IR args .
> +This argument is a pointer to
> +.IR "struct\ sock_fprog" ;
> +it can be designed to filter arbitrary system calls and system call
> +arguments. If the filter is invalid, the call will fail, returning
> +.BR EACCESS
> +in
> +.IR errno .
> +
> +If
> +.BR fork (2),
> +.BR clone (2),
> +or
> +.BR execve (2)
> +are allowed by the filter, any child processes will be constrained to
> +the same filters and system calls as the parent.
> +
> +Prior to using this operation, the process must call
> +.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)"
> +or run with
> +.BR CAP_SYS_ADMIN
> +privileges in its namespace. If these are not true, the call will fail
> +and return
> +.BR EACCES
> +in
> +.IR errno .
> +This requirement ensures that filter programs cannot be applied to child
> +processes with greater privileges than the process that installed them.
> +
> +Additionally, if
> +.BR prctl (2)
> +or
> +.BR seccomp (2)
> +is allowed by the attached filter, additional filters may be layered on
> +which will increase evaluation time, but allow for further reduction of
> +the attack surface during execution of a process.
> +
> +This operation is available only if the kernel is configured with
> +.BR CONFIG_SECCOMP_FILTER
> +enabled.
> +
> +When
> +.IR flags
> +are 0, this operation is functionally identical to calling
> +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" .
> +
> +The recognized
> +.IR flags
> +are:
> +.RS
> +.TP
> +.BR SECCOMP_FILTER_FLAG_TSYNC
> +When adding a new filter, synchronize all other threads of the current
> +process to the same seccomp filter tree. If any thread cannot do this,
> +the call will not attach the new seccomp filter, and will fail returning
> +the first thread ID found that cannot synchronize. Synchronization will
> +fail if another thread is in
> +.BR SECCOMP_MODE_STRICT
> +or if it has attached new seccomp filters to itself, diverging from the
> +calling thread's filter tree.
> +.RE
> +.SH FILTERS
> +When adding filters via
> +.BR SECCOMP_SET_MODE_FILTER ,
> +.IR args
> +points to a filter program:
> +
> +.in +4n
> +.nf
> +struct sock_fprog {
> + unsigned short len; /* Number of filter blocks */
> + struct sock_filter *filter;
> +};
> +.fi
> +.in
> +
> +Each program must contain one or more BPF instructions:
> +
> +.in +4n
> +.nf
> +struct sock_filter { /* Filter block */
> + __u16 code; /* Actual filter code */
> + __u8 jt; /* Jump true */
> + __u8 jf; /* Jump false */
> + __u32 k; /* Generic multiuse field */
> +};
> +.fi
> +.in
> +
> +When executing the instructions, the BPF program executes over the
> +syscall information made available via:
> +
> +.in +4n
> +.nf
> +struct seccomp_data {
> + int nr; /* system call number */
> + __u32 arch; /* AUDIT_ARCH_* value */
> + __u64 instruction_pointer; /* CPU instruction pointer */
> + __u64 args[6]; /* up to 6 system call arguments */
> +};
> +.fi
> +.in
> +
> +A seccomp filter may return any of the following values. If multiple
> +filters exist, the return value for the evaluation of a given system
> +call will always use the highest precedent value. (For example,
> +.BR SECCOMP_RET_KILL
> +will always take precedence.)
> +
> +In precedence order, they are:
> +.TP
> +.BR SECCOMP_RET_KILL
> +Results in the task exiting immediately without executing the
> +system call. The exit status of the task (status & 0x7f) will
> +be
> +.BR SIGSYS ,
> +not
> +.BR SIGKILL .
> +.TP
> +.BR SECCOMP_RET_TRAP
> +Results in the kernel sending a
> +.BR SIGSYS
> +signal to the triggering task without executing the system call.
> +.IR siginfo\->si_call_addr
> +will show the address of the system call instruction, and
> +.IR siginfo\->si_syscall
> +and
> +.IR siginfo\->si_arch
> +will indicate which syscall was attempted. The program counter will be
> +as though the syscall happened (i.e. it will not point to the syscall
> +instruction). The return value register will contain an arch\-dependent
> +value; if resuming execution, set it to something sensible.
> +(The architecture dependency is because replacing it with
> +.BR ENOSYS
> +could overwrite some useful information.)
> +
> +The
> +.BR SECCOMP_RET_DATA
> +portion of the return value will be passed as
> +.IR si_errno .
> +
> +.BR SIGSYS
> +triggered by seccomp will have a
> +.IR si_code
> +of
> +.BR SYS_SECCOMP .
> +.TP
> +.BR SECCOMP_RET_ERRNO
> +Results in the lower 16-bits of the return value being passed
> +to userland as the
> +.IR errno
> +without executing the system call.
> +.TP
> +.BR SECCOMP_RET_TRACE
> +When returned, this value will cause the kernel to attempt to
> +notify a ptrace()-based tracer prior to executing the system
> +call. If there is no tracer present,
> +.BR ENOSYS
> +is returned to userland and the system call is not executed.
> +
> +A tracer will be notified if it requests
> +.BR PTRACE_O_TRACESECCOMP
> +using
> +.IR ptrace(PTRACE_SETOPTIONS) .
> +The tracer will be notified of a
> +.BR PTRACE_EVENT_SECCOMP
> +and the
> +.BR SECCOMP_RET_DATA
> +portion of the BPF program return value will be available to the tracer
> +via
> +.BR PTRACE_GETEVENTMSG .
> +
> +The tracer can skip the system call by changing the syscall number
> +to \-1. Alternatively, the tracer can change the system call
> +requested by changing the system call to a valid syscall number. If
> +the tracer asks to skip the system call, then the system call will
> +appear to return the value that the tracer puts in the return value
> +register.
> +
> +The seccomp check will not be run again after the tracer is
> +notified. (This means that seccomp-based sandboxes MUST NOT
> +allow use of ptrace, even of other sandboxed processes, without
> +extreme care; ptracers can use this mechanism to escape.)
> +.TP
> +.BR SECCOMP_RET_ALLOW
> +Results in the system call being executed.
> +
> +If multiple filters exist, the return value for the evaluation of a
> +given system call will always use the highest precedent value.
> +
> +Precedence is only determined using the
> +.BR SECCOMP_RET_ACTION
> +mask. When multiple filters return values of the same precedence,
> +only the
> +.BR SECCOMP_RET_DATA
> +from the most recently installed filter will be returned.
> +.SH RETURN VALUE
> +On success,
> +.BR seccomp ()
> +returns 0.
> +On error, if
> +.BR SECCOMP_FILTER_FLAG_TSYNC
> +was used, the return value is the thread ID that caused the
> +synchronization failure. On other errors, \-1 is returned, and
> +.IR errno
> +is set to indicate the cause of the error.
> +.SH ERRORS
> +.BR seccomp ()
> +can fail for the following reasons:
> +.TP
> +.BR EACCESS
> +the caller did not have the
> +.BR CAP_SYS_ADMIN
> +capability, or had not set
> +.IR no_new_privs
> +before using
> +.BR SECCOMP_SET_MODE_FILTER .
> +.TP
> +.BR EFAULT
> +.IR args
> +was required to be a valid address.
> +.TP
> +.BR EINVAL
> +.IR operation
> +is unknown; or
> +.IR flags
> +are invalid for the given
> +.IR operation
> +.TP
> +.BR ESRCH
> +Another thread caused a failure during thread sync, but its ID could not
> +be determined.
> +.SH VERSIONS
> +This system call first appeared in Linux 3.16.
> +.\" FIXME Add glibc version
> +.SH CONFORMING TO
> +This system call is a nonstandard Linux extensions.
> +.SH NOTES
> +.BR seccomp ()
> +provides a superset of the functionality provided by
> +.IR PR_SET_SECCOMP
> +of
> +.BR prctl (2) .
> +(Which does not support
> +.IR flags .)
> +.SH SEE ALSO
> +.ad l
> +.nh
> +.BR prctl (2),
> +.BR ptrace (2),
> +.BR signal (7)
> +.BR socket (7)
> +.ad
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-06-24 16:44:04

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 1/1] man-pages: seccomp.2: document syscall

On Tue, Jun 24, 2014 at 3:23 AM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On 06/24/2014 12:01 AM, Kees Cook wrote:
>> Combines documentation from prctl, and in-kernel seccomp_filter.txt,
>> along with new details specific to the new syscall.
>
> Great work on the man page, Kees! (BTW, just looking at the complexity detailed
> there further supports the decision to grant this functionality as a separate
> syscall, rather than multiplexed into prctl(2).

Great, thanks!

> Would there be some suitable, not too long program that we
> could put in the man page as an example for using filters?

Sure thing. I can modify the "dropper" sample in samples/seccomp. I
will resend the man-page with that added.

Thanks!

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 16:54:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] seccomp: introduce writer locking

Kees,

I am still trying to force myself to read and try to understand what
this series does ;) Just a minor nit so far.

On 06/23, Kees Cook wrote:
>
> @@ -1142,6 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> {
> int retval;
> struct task_struct *p;
> + unsigned long irqflags;
>
> if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> return ERR_PTR(-EINVAL);
> @@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> goto fork_out;
>
> ftrace_graph_init_task(p);
> - get_seccomp_filter(p);
>
> rt_mutex_init_task(p);
>
> @@ -1434,7 +1460,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> p->parent_exec_id = current->self_exec_id;
> }
>
> - spin_lock(&current->sighand->siglock);
> + spin_lock_irqsave(&current->sighand->siglock, irqflags);
> +
> + /*
> + * Copy seccomp details explicitly here, in case they were changed
> + * before holding tasklist_lock.
> + */
> + copy_seccomp(p);
>
> /*
> * Process group and session signals need to be delivered to just the
> @@ -1446,7 +1478,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> */
> recalc_sigpending();
> if (signal_pending(current)) {
> - spin_unlock(&current->sighand->siglock);
> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
> write_unlock_irq(&tasklist_lock);
> retval = -ERESTARTNOINTR;
> goto bad_fork_free_pid;
> @@ -1486,7 +1518,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> }
>
> total_forks++;
> - spin_unlock(&current->sighand->siglock);
> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
> write_unlock_irq(&tasklist_lock);
> proc_fork_connector(p);
> cgroup_post_fork(p);

It seems that the only change copy_process() needs is copy_seccomp() under the locks.
Everythinh else (irqflags games) looks obviously unneeded?

> @@ -524,6 +528,9 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
> }
> #endif
>
> + if (unlikely(!lock_task_sighand(current, &irqflags)))
> + goto out_free;
> +

Unless this task is exiting (namely, it has already called exit_notify()),
lock_task_sighand(current) must not fail. Looks like you can simly do
spin_lock_irq(->siglock).

Oleg.

2014-06-24 17:30:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/23, Kees Cook wrote:
>
> +static pid_t seccomp_can_sync_threads(void)
> +{
> + struct task_struct *thread, *caller;
> +
> + BUG_ON(write_can_lock(&tasklist_lock));
> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
> +
> + if (current->seccomp.mode != SECCOMP_MODE_FILTER)
> + return -EACCES;
> +
> + /* Validate all threads being eligible for synchronization. */
> + thread = caller = current;
> + for_each_thread(caller, thread) {
> + pid_t failed;
> +
> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
> + (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
> + is_ancestor(thread->seccomp.filter,
> + caller->seccomp.filter)))
> + continue;
> +
> + /* Return the first thread that cannot be synchronized. */
> + failed = task_pid_vnr(thread);
> + /* If the pid cannot be resolved, then return -ESRCH */
> + if (failed == 0)
> + failed = -ESRCH;

forgot to mention, task_pid_vnr() can't fail. sighand->siglock is held,
for_each_thread() can't see a thread which has passed unhash_process().

Oleg.

2014-06-24 17:49:25

by Kees Cook

[permalink] [raw]
Subject: [PATCH v7.1 1/1] man-pages: seccomp.2: document syscall

Combines documentation from prctl, in-kernel seccomp_filter.txt and
dropper.c, along with details specific to the new syscall.

Signed-off-by: Kees Cook <[email protected]>
---
v2:
- add full example code, based on "dropper.c" in samples/seccomp/
---
man2/seccomp.2 | 400 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 400 insertions(+)
create mode 100644 man2/seccomp.2

diff --git a/man2/seccomp.2 b/man2/seccomp.2
new file mode 100644
index 0000000..80d75c9
--- /dev/null
+++ b/man2/seccomp.2
@@ -0,0 +1,400 @@
+.\" Copyright (C) 2014 Kees Cook <[email protected]>
+.\" and Copyright (C) 2012 Will Drewry <[email protected]>
+.\" and Copyright (C) 2008 Michael Kerrisk <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
+.SH NAME
+seccomp \-
+operate on Secure Computing state of the process
+.SH SYNOPSIS
+.nf
+.B #include <linux/seccomp.h>
+.B #include <linux/filter.h>
+.B #include <linux/audit.h>
+.B #include <linux/signal.h>
+.B #include <sys/ptrace.h>
+
+.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
+.BI " unsigned char *" args );
+.fi
+.SH DESCRIPTION
+The
+.BR seccomp ()
+system call operates on the Secure Computing (seccomp) state of the
+current process.
+
+Currently, Linux supports the following
+.IR operation
+values:
+.TP
+.BR SECCOMP_SET_MODE_STRICT
+Only system calls that the thread is permitted to make are
+.BR read (2),
+.BR write (2),
+.BR _exit (2),
+and
+.BR sigreturn (2).
+Other system calls result in the delivery of a
+.BR SIGKILL
+signal. Strict secure computing mode is useful for number-crunching
+applications that may need to execute untrusted byte code, perhaps
+obtained by reading from a pipe or socket.
+
+This operation is available only if the kernel is configured with
+.BR CONFIG_SECCOMP
+enabled.
+
+The value of
+.IR flags
+must be 0, and
+.IR args
+must be NULL.
+
+This operation is functionally identical to calling
+.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" .
+.TP
+.BR SECCOMP_SET_MODE_FILTER
+The system calls allowed are defined by a pointer to a Berkeley Packet
+Filter (BPF) passed via
+.IR args .
+This argument is a pointer to
+.IR "struct\ sock_fprog" ;
+it can be designed to filter arbitrary system calls and system call
+arguments. If the filter is invalid, the call will fail, returning
+.BR EACCESS
+in
+.IR errno .
+
+If
+.BR fork (2),
+.BR clone (2),
+or
+.BR execve (2)
+are allowed by the filter, any child processes will be constrained to
+the same filters and system calls as the parent.
+
+Prior to using this operation, the process must call
+.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)"
+or run with
+.BR CAP_SYS_ADMIN
+privileges in its namespace. If these are not true, the call will fail
+and return
+.BR EACCES
+in
+.IR errno .
+This requirement ensures that filter programs cannot be applied to child
+processes with greater privileges than the process that installed them.
+
+Additionally, if
+.BR prctl (2)
+or
+.BR seccomp (2)
+is allowed by the attached filter, additional filters may be layered on
+which will increase evaluation time, but allow for further reduction of
+the attack surface during execution of a process.
+
+This operation is available only if the kernel is configured with
+.BR CONFIG_SECCOMP_FILTER
+enabled.
+
+When
+.IR flags
+are 0, this operation is functionally identical to calling
+.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" .
+
+The recognized
+.IR flags
+are:
+.RS
+.TP
+.BR SECCOMP_FILTER_FLAG_TSYNC
+When adding a new filter, synchronize all other threads of the current
+process to the same seccomp filter tree. If any thread cannot do this,
+the call will not attach the new seccomp filter, and will fail returning
+the first thread ID found that cannot synchronize. Synchronization will
+fail if another thread is in
+.BR SECCOMP_MODE_STRICT
+or if it has attached new seccomp filters to itself, diverging from the
+calling thread's filter tree.
+.RE
+.SH FILTERS
+When adding filters via
+.BR SECCOMP_SET_MODE_FILTER ,
+.IR args
+points to a filter program:
+
+.in +4n
+.nf
+struct sock_fprog {
+ unsigned short len; /* Number of BPF instructions */
+ struct sock_filter *filter;
+};
+.fi
+.in
+
+Each program must contain one or more BPF instructions:
+
+.in +4n
+.nf
+struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+};
+.fi
+.in
+
+When executing the instructions, the BPF program executes over the
+syscall information made available via:
+
+.in +4n
+.nf
+struct seccomp_data {
+ int nr; /* system call number */
+ __u32 arch; /* AUDIT_ARCH_* value */
+ __u64 instruction_pointer; /* CPU instruction pointer */
+ __u64 args[6]; /* up to 6 system call arguments */
+};
+.fi
+.in
+
+A seccomp filter may return any of the following values. If multiple
+filters exist, the return value for the evaluation of a given system
+call will always use the highest precedent value. (For example,
+.BR SECCOMP_RET_KILL
+will always take precedence.)
+
+In precedence order, they are:
+.TP
+.BR SECCOMP_RET_KILL
+Results in the task exiting immediately without executing the
+system call. The exit status of the task (status & 0x7f) will
+be
+.BR SIGSYS ,
+not
+.BR SIGKILL .
+.TP
+.BR SECCOMP_RET_TRAP
+Results in the kernel sending a
+.BR SIGSYS
+signal to the triggering task without executing the system call.
+.IR siginfo\->si_call_addr
+will show the address of the system call instruction, and
+.IR siginfo\->si_syscall
+and
+.IR siginfo\->si_arch
+will indicate which syscall was attempted. The program counter will be
+as though the syscall happened (i.e. it will not point to the syscall
+instruction). The return value register will contain an arch\-dependent
+value; if resuming execution, set it to something sensible.
+(The architecture dependency is because replacing it with
+.BR ENOSYS
+could overwrite some useful information.)
+
+The
+.BR SECCOMP_RET_DATA
+portion of the return value will be passed as
+.IR si_errno .
+
+.BR SIGSYS
+triggered by seccomp will have a
+.IR si_code
+of
+.BR SYS_SECCOMP .
+.TP
+.BR SECCOMP_RET_ERRNO
+Results in the lower 16-bits of the return value being passed
+to userland as the
+.IR errno
+without executing the system call.
+.TP
+.BR SECCOMP_RET_TRACE
+When returned, this value will cause the kernel to attempt to
+notify a ptrace()-based tracer prior to executing the system
+call. If there is no tracer present,
+.BR ENOSYS
+is returned to userland and the system call is not executed.
+
+A tracer will be notified if it requests
+.BR PTRACE_O_TRACESECCOMP
+using
+.IR ptrace(PTRACE_SETOPTIONS) .
+The tracer will be notified of a
+.BR PTRACE_EVENT_SECCOMP
+and the
+.BR SECCOMP_RET_DATA
+portion of the BPF program return value will be available to the tracer
+via
+.BR PTRACE_GETEVENTMSG .
+
+The tracer can skip the system call by changing the syscall number
+to \-1. Alternatively, the tracer can change the system call
+requested by changing the system call to a valid syscall number. If
+the tracer asks to skip the system call, then the system call will
+appear to return the value that the tracer puts in the return value
+register.
+
+The seccomp check will not be run again after the tracer is
+notified. (This means that seccomp-based sandboxes MUST NOT
+allow use of ptrace, even of other sandboxed processes, without
+extreme care; ptracers can use this mechanism to escape.)
+.TP
+.BR SECCOMP_RET_ALLOW
+Results in the system call being executed.
+
+If multiple filters exist, the return value for the evaluation of a
+given system call will always use the highest precedent value.
+
+Precedence is only determined using the
+.BR SECCOMP_RET_ACTION
+mask. When multiple filters return values of the same precedence,
+only the
+.BR SECCOMP_RET_DATA
+from the most recently installed filter will be returned.
+.SH RETURN VALUE
+On success,
+.BR seccomp ()
+returns 0.
+On error, if
+.BR SECCOMP_FILTER_FLAG_TSYNC
+was used, the return value is the thread ID that caused the
+synchronization failure. On other errors, \-1 is returned, and
+.IR errno
+is set to indicate the cause of the error.
+.SH ERRORS
+.BR seccomp ()
+can fail for the following reasons:
+.TP
+.BR EACCESS
+the caller did not have the
+.BR CAP_SYS_ADMIN
+capability, or had not set
+.IR no_new_privs
+before using
+.BR SECCOMP_SET_MODE_FILTER .
+.TP
+.BR EFAULT
+.IR args
+was required to be a valid address.
+.TP
+.BR EINVAL
+.IR operation
+is unknown; or
+.IR flags
+are invalid for the given
+.IR operation
+.TP
+.BR ESRCH
+Another thread caused a failure during thread sync, but its ID could not
+be determined.
+.SH VERSIONS
+This system call first appeared in Linux 3.16.
+.\" FIXME Add glibc version
+.SH CONFORMING TO
+This system call is a nonstandard Linux extensions.
+.SH NOTES
+.BR seccomp ()
+provides a superset of the functionality provided by
+.IR PR_SET_SECCOMP
+of
+.BR prctl (2) .
+(Which does not support
+.IR flags .)
+.SH EXAMPLE
+.nf
+#include <errno.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <linux/audit.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <sys/prctl.h>
+
+static int install_filter(int syscall, int arch, int error)
+{
+ struct sock_filter filter[] = {
+ /* Load architecture. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, arch))),
+ /* Jump forward 4 instructions on architecture mismatch. */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 4),
+ /* Load syscall number. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+ /* Jump forward 1 instruction on syscall mismatch. */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, syscall, 0, 1),
+ /* Matching arch and syscall: return specific errno. */
+ BPF_STMT(BPF_RET+BPF_K,
+ SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)),
+ /* Destination of syscall mismatch: Allow other syscalls. */
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ /* Destination of arch mismatch: Kill process. */
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
+ perror("seccomp");
+ return EXIT_FAILURE;
+ }
+ return EXIT_SUCCESS;
+}
+
+int main(int argc, char **argv)
+{
+ if (argc < 5) {
+ fprintf(stderr, "Usage:\\n"
+ "refuse <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
+ "Hint: AUDIT_ARCH_I386: 0x%X\\n"
+ " AUDIT_ARCH_X86_64: 0x%X\\n"
+ "\\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
+ return EXIT_FAILURE;
+ }
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl");
+ return EXIT_FAILURE;
+ }
+ if (install_filter(strtol(argv[1], NULL, 0),
+ strtol(argv[2], NULL, 0),
+ strtol(argv[3], NULL, 0)))
+ return EXIT_FAILURE;
+ execv(argv[4], &argv[4]);
+ perror("execv");
+ return EXIT_FAILURE;
+}
+.fi
+.SH SEE ALSO
+.ad l
+.nh
+.BR prctl (2),
+.BR ptrace (2),
+.BR signal (7),
+.BR socket (7)
+.ad
--
1.7.9.5


--
Kees Cook @outflux.net

2014-06-24 17:54:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/23, Kees Cook wrote:
>
> +static pid_t seccomp_can_sync_threads(void)
> +{
> + struct task_struct *thread, *caller;
> +
> + BUG_ON(write_can_lock(&tasklist_lock));
> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
> +
> + if (current->seccomp.mode != SECCOMP_MODE_FILTER)
> + return -EACCES;
> +
> + /* Validate all threads being eligible for synchronization. */
> + thread = caller = current;
> + for_each_thread(caller, thread) {

You only need to initialize "caller" for for_each_thread(). Same for
seccomp_sync_threads().

> @@ -586,6 +701,17 @@ static long seccomp_set_mode_filter(unsigned int flags,
> if (IS_ERR(prepared))
> return PTR_ERR(prepared);
>
> + /*
> + * If we're doing thread sync, we must hold tasklist_lock
> + * to make sure seccomp filter changes are stable on threads
> + * entering or leaving the task list. And we must take it
> + * before the sighand lock to avoid deadlocking.
> + */
> + if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> + write_lock_irqsave(&tasklist_lock, taskflags);
> + else
> + __acquire(&tasklist_lock); /* keep sparse happy */
> +

Why? ->siglock should be enough, it seems.

It obviously does not protect the global process list, but *sync_threads()
only care about current's thread group list, no?

Oleg.

2014-06-24 18:02:10

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] seccomp: introduce writer locking

On Tue, Jun 24, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
> Kees,
>
> I am still trying to force myself to read and try to understand what
> this series does ;) Just a minor nit so far.

The use-case this solves is when a userspace process does not control
(or know) when a thread is spawned (e.g. via shared library init, or
LD_PRELOAD) but wants to make sure seccomp filters have been applied
to it. Specifically, Chrome, when loading some proprietary graphics
drivers, will have those drivers spawning threads before there has
been an ability to call seccomp. While some games can be played to get
ahead of it, it's not always possible, or racey, depending on the
drivers. With seccomp thread-sync, we can be certain that all threads
have had the filter applied.

>
> On 06/23, Kees Cook wrote:
>>
>> @@ -1142,6 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> {
>> int retval;
>> struct task_struct *p;
>> + unsigned long irqflags;
>>
>> if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>> return ERR_PTR(-EINVAL);
>> @@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> goto fork_out;
>>
>> ftrace_graph_init_task(p);
>> - get_seccomp_filter(p);
>>
>> rt_mutex_init_task(p);
>>
>> @@ -1434,7 +1460,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> p->parent_exec_id = current->self_exec_id;
>> }
>>
>> - spin_lock(&current->sighand->siglock);
>> + spin_lock_irqsave(&current->sighand->siglock, irqflags);
>> +
>> + /*
>> + * Copy seccomp details explicitly here, in case they were changed
>> + * before holding tasklist_lock.
>> + */
>> + copy_seccomp(p);
>>
>> /*
>> * Process group and session signals need to be delivered to just the
>> @@ -1446,7 +1478,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> */
>> recalc_sigpending();
>> if (signal_pending(current)) {
>> - spin_unlock(&current->sighand->siglock);
>> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
>> write_unlock_irq(&tasklist_lock);
>> retval = -ERESTARTNOINTR;
>> goto bad_fork_free_pid;
>> @@ -1486,7 +1518,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> }
>>
>> total_forks++;
>> - spin_unlock(&current->sighand->siglock);
>> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
>> write_unlock_irq(&tasklist_lock);
>> proc_fork_connector(p);
>> cgroup_post_fork(p);
>
> It seems that the only change copy_process() needs is copy_seccomp() under the locks.
> Everythinh else (irqflags games) looks obviously unneeded?

I got irq lock dep warnings without these changes. If they're
unneeded, that's totally fine by me, but some change (either this or
markings to keep lockdep happy) is needed.

>> @@ -524,6 +528,9 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
>> }
>> #endif
>>
>> + if (unlikely(!lock_task_sighand(current, &irqflags)))
>> + goto out_free;
>> +
>
> Unless this task is exiting (namely, it has already called exit_notify()),
> lock_task_sighand(current) must not fail. Looks like you can simply do
> spin_lock_irq(->siglock).

Okay. I wasn't sure if I needed to be extra careful here or not. I
opted for just using lock_task_sighand since that seemed to be the
most used. :)

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 18:05:26

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Tue, Jun 24, 2014 at 10:27 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/23, Kees Cook wrote:
>>
>> +static pid_t seccomp_can_sync_threads(void)
>> +{
>> + struct task_struct *thread, *caller;
>> +
>> + BUG_ON(write_can_lock(&tasklist_lock));
>> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
>> +
>> + if (current->seccomp.mode != SECCOMP_MODE_FILTER)
>> + return -EACCES;
>> +
>> + /* Validate all threads being eligible for synchronization. */
>> + thread = caller = current;
>> + for_each_thread(caller, thread) {
>> + pid_t failed;
>> +
>> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
>> + (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
>> + is_ancestor(thread->seccomp.filter,
>> + caller->seccomp.filter)))
>> + continue;
>> +
>> + /* Return the first thread that cannot be synchronized. */
>> + failed = task_pid_vnr(thread);
>> + /* If the pid cannot be resolved, then return -ESRCH */
>> + if (failed == 0)
>> + failed = -ESRCH;
>
> forgot to mention, task_pid_vnr() can't fail. sighand->siglock is held,
> for_each_thread() can't see a thread which has passed unhash_process().

Certainly good to know, but I'd be much more comfortable leaving this
check as-is. Having "failed" return with "0" would be very very bad
(userspace would think the filter had been successfully applied, etc).
I'd rather stay highly defensive here.

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 18:06:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v7 1/1] man-pages: seccomp.2: document syscall

On Mon, Jun 23, 2014 at 3:01 PM, Kees Cook <[email protected]> wrote:
> Combines documentation from prctl, and in-kernel seccomp_filter.txt,
> along with new details specific to the new syscall.
>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> man2/seccomp.2 | 333 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 333 insertions(+)
> create mode 100644 man2/seccomp.2
>
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> new file mode 100644
> index 0000000..de7fbf7
> --- /dev/null
> +++ b/man2/seccomp.2
> @@ -0,0 +1,333 @@
> +.\" Copyright (C) 2014 Kees Cook <[email protected]>
> +.\" and Copyright (C) 2012 Will Drewry <[email protected]>
> +.\" and Copyright (C) 2008 Michael Kerrisk <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +seccomp \-
> +operate on Secure Computing state of the process
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/seccomp.h>
> +.B #include <linux/filter.h>
> +.B #include <linux/audit.h>
> +.B #include <linux/signal.h>
> +.B #include <sys/ptrace.h>
> +
> +.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
> +.BI " unsigned char *" args );

At the very least, shouldn't this be void *args?

--Andy

2014-06-24 18:19:31

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Tue, Jun 24, 2014 at 10:08 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/23, Kees Cook wrote:
>>
>> +static pid_t seccomp_can_sync_threads(void)
>> +{
>> + struct task_struct *thread, *caller;
>> +
>> + BUG_ON(write_can_lock(&tasklist_lock));
>> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
>> +
>> + if (current->seccomp.mode != SECCOMP_MODE_FILTER)
>> + return -EACCES;
>> +
>> + /* Validate all threads being eligible for synchronization. */
>> + thread = caller = current;
>> + for_each_thread(caller, thread) {
>
> You only need to initialize "caller" for for_each_thread(). Same for
> seccomp_sync_threads().

Thanks, I'll fix this up.

>> @@ -586,6 +701,17 @@ static long seccomp_set_mode_filter(unsigned int flags,
>> if (IS_ERR(prepared))
>> return PTR_ERR(prepared);
>>
>> + /*
>> + * If we're doing thread sync, we must hold tasklist_lock
>> + * to make sure seccomp filter changes are stable on threads
>> + * entering or leaving the task list. And we must take it
>> + * before the sighand lock to avoid deadlocking.
>> + */
>> + if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>> + write_lock_irqsave(&tasklist_lock, taskflags);
>> + else
>> + __acquire(&tasklist_lock); /* keep sparse happy */
>> +
>
> Why? ->siglock should be enough, it seems.
>
> It obviously does not protect the global process list, but *sync_threads()
> only care about current's thread group list, no?

I think I was concerned about the exit case, but reading through those
paths again, I can't find a race. Calls to put_seccomp_filter() should
already be safe. Let me see what happens if I drop the tasklist_lock
usage...

-Kees


--
Kees Cook
Chrome OS Security

2014-06-24 18:32:32

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] seccomp: introduce writer locking

I am puzzled by the usage of smp_load_acquire(),

On 06/23, Kees Cook wrote:
>
> static u32 seccomp_run_filters(int syscall)
> {
> - struct seccomp_filter *f;
> + struct seccomp_filter *f = smp_load_acquire(&current->seccomp.filter);
> struct seccomp_data sd;
> u32 ret = SECCOMP_RET_ALLOW;
>
> /* Ensure unexpected behavior doesn't result in failing open. */
> - if (WARN_ON(current->seccomp.filter == NULL))
> + if (WARN_ON(f == NULL))
> return SECCOMP_RET_KILL;
>
> populate_seccomp_data(&sd);
> @@ -186,9 +186,8 @@ static u32 seccomp_run_filters(int syscall)
> * All filters in the list are evaluated and the lowest BPF return
> * value always takes priority (ignoring the DATA).
> */
> - for (f = current->seccomp.filter; f; f = f->prev) {
> + for (; f; f = smp_load_acquire(&f->prev)) {
> u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);
> -
> if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
> ret = cur_ret;

OK, in this case the 1st one is probably fine, altgough it is not
clear to me why it is better than read_barrier_depends().

But why do we need a 2nd one inside the loop? And if we actually need
it (I don't think so) then why it is safe to use f->prog without
load_acquire ?

> void get_seccomp_filter(struct task_struct *tsk)
> {
> - struct seccomp_filter *orig = tsk->seccomp.filter;
> + struct seccomp_filter *orig = smp_load_acquire(&tsk->seccomp.filter);
> if (!orig)
> return;

This one looks unneeded.

First of all, afaics atomic_inc() should work correctly without any barriers,
otherwise it is buggy. But even this doesn't matter.

With this changes get_seccomp_filter() must be called under ->siglock, it can't
race with add-filter and thus tsk->seccomp.filter should be stable.

> /* Reference count is bounded by the number of total processes. */
> @@ -361,7 +364,7 @@ void put_seccomp_filter(struct task_struct *tsk)
> /* Clean up single-reference branches iteratively. */
> while (orig && atomic_dec_and_test(&orig->usage)) {
> struct seccomp_filter *freeme = orig;
> - orig = orig->prev;
> + orig = smp_load_acquire(&orig->prev);
> seccomp_filter_free(freeme);
> }

This one looks unneeded too. And note that this patch does not add
smp_load_acquire() to read tsk->seccomp.filter.

atomic_dec_and_test() adds mb(), we do not need more barriers to access
->prev ?

Oleg.

2014-06-24 18:37:33

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] seccomp: introduce writer locking

On 06/24, Kees Cook wrote:
>
> On Tue, Jun 24, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
> > Kees,
> >
> > I am still trying to force myself to read and try to understand what
> > this series does ;) Just a minor nit so far.
>
> The use-case this solves is when a userspace process does not control
> (or know) when a thread is spawned (e.g. via shared library init, or
> LD_PRELOAD) but wants to make sure seccomp filters have been applied
> to it.

Yes, thanks, I understand this. But the details are not clear to me so
far, I'll try to re-read this series later.

> >> @@ -1142,6 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >> {
> >> int retval;
> >> struct task_struct *p;
> >> + unsigned long irqflags;
> >>
> >> if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> >> return ERR_PTR(-EINVAL);
> >> @@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >> goto fork_out;
> >>
> >> ftrace_graph_init_task(p);
> >> - get_seccomp_filter(p);
> >>
> >> rt_mutex_init_task(p);
> >>
> >> @@ -1434,7 +1460,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >> p->parent_exec_id = current->self_exec_id;
> >> }
> >>
> >> - spin_lock(&current->sighand->siglock);
> >> + spin_lock_irqsave(&current->sighand->siglock, irqflags);
> >> +
> >> + /*
> >> + * Copy seccomp details explicitly here, in case they were changed
> >> + * before holding tasklist_lock.
> >> + */
> >> + copy_seccomp(p);
> >>
> >> /*
> >> * Process group and session signals need to be delivered to just the
> >> @@ -1446,7 +1478,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >> */
> >> recalc_sigpending();
> >> if (signal_pending(current)) {
> >> - spin_unlock(&current->sighand->siglock);
> >> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
> >> write_unlock_irq(&tasklist_lock);
> >> retval = -ERESTARTNOINTR;
> >> goto bad_fork_free_pid;
> >> @@ -1486,7 +1518,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >> }
> >>
> >> total_forks++;
> >> - spin_unlock(&current->sighand->siglock);
> >> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
> >> write_unlock_irq(&tasklist_lock);
> >> proc_fork_connector(p);
> >> cgroup_post_fork(p);
> >
> > It seems that the only change copy_process() needs is copy_seccomp() under the locks.
> > Everythinh else (irqflags games) looks obviously unneeded?
>
> I got irq lock dep warnings without these changes.

With or without your patches? Could you show the waring?

> If they're
> unneeded, that's totally fine by me, but some change (either this or
> markings to keep lockdep happy) is needed.

Yes, we need to understand what what happens...

Oleg.

2014-06-24 18:39:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On 06/24, Kees Cook wrote:
>
> On Tue, Jun 24, 2014 at 10:27 AM, Oleg Nesterov <[email protected]> wrote:
> > On 06/23, Kees Cook wrote:
> >>
> >> +static pid_t seccomp_can_sync_threads(void)
> >> +{
> >> + struct task_struct *thread, *caller;
> >> +
> >> + BUG_ON(write_can_lock(&tasklist_lock));
> >> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
> >> +
> >> + if (current->seccomp.mode != SECCOMP_MODE_FILTER)
> >> + return -EACCES;
> >> +
> >> + /* Validate all threads being eligible for synchronization. */
> >> + thread = caller = current;
> >> + for_each_thread(caller, thread) {
> >> + pid_t failed;
> >> +
> >> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
> >> + (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
> >> + is_ancestor(thread->seccomp.filter,
> >> + caller->seccomp.filter)))
> >> + continue;
> >> +
> >> + /* Return the first thread that cannot be synchronized. */
> >> + failed = task_pid_vnr(thread);
> >> + /* If the pid cannot be resolved, then return -ESRCH */
> >> + if (failed == 0)
> >> + failed = -ESRCH;
> >
> > forgot to mention, task_pid_vnr() can't fail. sighand->siglock is held,
> > for_each_thread() can't see a thread which has passed unhash_process().
>
> Certainly good to know, but I'd be much more comfortable leaving this
> check as-is. Having "failed" return with "0" would be very very bad
> (userspace would think the filter had been successfully applied, etc).
> I'd rather stay highly defensive here.

OK, agreed. Although in this case I'd suggest

if (WARN_ON(failed == 0))
failed = -ESRCH;

but I won't insist.

Oleg.

2014-06-24 19:08:22

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

On Tue, Jun 24, 2014 at 11:37 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Kees Cook wrote:
>>
>> On Tue, Jun 24, 2014 at 10:27 AM, Oleg Nesterov <[email protected]> wrote:
>> > On 06/23, Kees Cook wrote:
>> >>
>> >> +static pid_t seccomp_can_sync_threads(void)
>> >> +{
>> >> + struct task_struct *thread, *caller;
>> >> +
>> >> + BUG_ON(write_can_lock(&tasklist_lock));
>> >> + BUG_ON(!spin_is_locked(&current->sighand->siglock));
>> >> +
>> >> + if (current->seccomp.mode != SECCOMP_MODE_FILTER)
>> >> + return -EACCES;
>> >> +
>> >> + /* Validate all threads being eligible for synchronization. */
>> >> + thread = caller = current;
>> >> + for_each_thread(caller, thread) {
>> >> + pid_t failed;
>> >> +
>> >> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
>> >> + (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
>> >> + is_ancestor(thread->seccomp.filter,
>> >> + caller->seccomp.filter)))
>> >> + continue;
>> >> +
>> >> + /* Return the first thread that cannot be synchronized. */
>> >> + failed = task_pid_vnr(thread);
>> >> + /* If the pid cannot be resolved, then return -ESRCH */
>> >> + if (failed == 0)
>> >> + failed = -ESRCH;
>> >
>> > forgot to mention, task_pid_vnr() can't fail. sighand->siglock is held,
>> > for_each_thread() can't see a thread which has passed unhash_process().
>>
>> Certainly good to know, but I'd be much more comfortable leaving this
>> check as-is. Having "failed" return with "0" would be very very bad
>> (userspace would think the filter had been successfully applied, etc).
>> I'd rather stay highly defensive here.
>
> OK, agreed. Although in this case I'd suggest
>
> if (WARN_ON(failed == 0))
> failed = -ESRCH;
>
> but I won't insist.

Excellent idea, yes! I'll include this as well.

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 19:18:29

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 1/1] man-pages: seccomp.2: document syscall

On Tue, Jun 24, 2014 at 11:06 AM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Jun 23, 2014 at 3:01 PM, Kees Cook <[email protected]> wrote:
>> Combines documentation from prctl, and in-kernel seccomp_filter.txt,
>> along with new details specific to the new syscall.
>>
>> Signed-off-by: Kees Cook <[email protected]>
>> ---
>> man2/seccomp.2 | 333 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 333 insertions(+)
>> create mode 100644 man2/seccomp.2
>>
>> diff --git a/man2/seccomp.2 b/man2/seccomp.2
>> new file mode 100644
>> index 0000000..de7fbf7
>> --- /dev/null
>> +++ b/man2/seccomp.2
>> @@ -0,0 +1,333 @@
>> +.\" Copyright (C) 2014 Kees Cook <[email protected]>
>> +.\" and Copyright (C) 2012 Will Drewry <[email protected]>
>> +.\" and Copyright (C) 2008 Michael Kerrisk <[email protected]>
>> +.\"
>> +.\" %%%LICENSE_START(VERBATIM)
>> +.\" Permission is granted to make and distribute verbatim copies of this
>> +.\" manual provided the copyright notice and this permission notice are
>> +.\" preserved on all copies.
>> +.\"
>> +.\" Permission is granted to copy and distribute modified versions of this
>> +.\" manual under the conditions for verbatim copying, provided that the
>> +.\" entire resulting derived work is distributed under the terms of a
>> +.\" permission notice identical to this one.
>> +.\"
>> +.\" Since the Linux kernel and libraries are constantly changing, this
>> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
>> +.\" responsibility for errors or omissions, or for damages resulting from
>> +.\" the use of the information contained herein. The author(s) may not
>> +.\" have taken the same level of care in the production of this manual,
>> +.\" which is licensed free of charge, as they might when working
>> +.\" professionally.
>> +.\"
>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>> +.\" the source, must acknowledge the copyright and authors of this work.
>> +.\" %%%LICENSE_END
>> +.\"
>> +.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +seccomp \-
>> +operate on Secure Computing state of the process
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <linux/seccomp.h>
>> +.B #include <linux/filter.h>
>> +.B #include <linux/audit.h>
>> +.B #include <linux/signal.h>
>> +.B #include <sys/ptrace.h>
>> +
>> +.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
>> +.BI " unsigned char *" args );
>
> At the very least, shouldn't this be void *args?

Yeah, good point. Fixed for the next version...

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 19:20:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

On 06/23, Kees Cook wrote:
>
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -3,6 +3,8 @@
>
> #include <uapi/linux/seccomp.h>
>
> +#define SECCOMP_FLAG_NO_NEW_PRIVS 0 /* task may not gain privs */
> +
> #ifdef CONFIG_SECCOMP
>
> #include <linux/thread_info.h>
> @@ -16,6 +18,7 @@ struct seccomp_filter;
> * system calls available to a process.
> * @filter: must always point to a valid seccomp-filter or NULL as it is
> * accessed without locking during system call entry.
> + * @flags: flags under task->sighand->siglock lock
> *
> * @filter must only be accessed from the context of current as there
> * is no read locking.
> @@ -23,6 +26,7 @@ struct seccomp_filter;
> struct seccomp {
> int mode;
> struct seccomp_filter *filter;
> + unsigned long flags;
> };
>
> extern int __secure_computing(int);
> @@ -51,7 +55,9 @@ static inline int seccomp_mode(struct seccomp *s)
>
> #include <linux/errno.h>
>
> -struct seccomp { };
> +struct seccomp {
> + unsigned long flags;
> +};

A bit messy ;)

I am wondering if we can simply do

static inline bool current_no_new_privs(void)
{
if (current->no_new_privs)
return true;

#ifdef CONFIG_SECCOMP
if (test_thread_flag(TIF_SECCOMP))
return true;
#endif

return false;

return test_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
}

instead ?

Oleg.

2014-06-24 19:21:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

On Tue, Jun 24, 2014 at 12:18 PM, Oleg Nesterov <[email protected]> wrote:
> On 06/23, Kees Cook wrote:
>>
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -3,6 +3,8 @@
>>
>> #include <uapi/linux/seccomp.h>
>>
>> +#define SECCOMP_FLAG_NO_NEW_PRIVS 0 /* task may not gain privs */
>> +
>> #ifdef CONFIG_SECCOMP
>>
>> #include <linux/thread_info.h>
>> @@ -16,6 +18,7 @@ struct seccomp_filter;
>> * system calls available to a process.
>> * @filter: must always point to a valid seccomp-filter or NULL as it is
>> * accessed without locking during system call entry.
>> + * @flags: flags under task->sighand->siglock lock
>> *
>> * @filter must only be accessed from the context of current as there
>> * is no read locking.
>> @@ -23,6 +26,7 @@ struct seccomp_filter;
>> struct seccomp {
>> int mode;
>> struct seccomp_filter *filter;
>> + unsigned long flags;
>> };
>>
>> extern int __secure_computing(int);
>> @@ -51,7 +55,9 @@ static inline int seccomp_mode(struct seccomp *s)
>>
>> #include <linux/errno.h>
>>
>> -struct seccomp { };
>> +struct seccomp {
>> + unsigned long flags;
>> +};
>
> A bit messy ;)
>
> I am wondering if we can simply do
>
> static inline bool current_no_new_privs(void)
> {
> if (current->no_new_privs)
> return true;
>
> #ifdef CONFIG_SECCOMP
> if (test_thread_flag(TIF_SECCOMP))
> return true;
> #endif

Nope -- privileged users can enable seccomp w/o nnp.

--Andy

2014-06-24 19:32:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

On 06/24, Andy Lutomirski wrote:
>
> On Tue, Jun 24, 2014 at 12:18 PM, Oleg Nesterov <[email protected]> wrote:
> >>
> >> -struct seccomp { };
> >> +struct seccomp {
> >> + unsigned long flags;
> >> +};
> >
> > A bit messy ;)
> >
> > I am wondering if we can simply do
> >
> > static inline bool current_no_new_privs(void)
> > {
> > if (current->no_new_privs)
> > return true;
> >
> > #ifdef CONFIG_SECCOMP
> > if (test_thread_flag(TIF_SECCOMP))
> > return true;
> > #endif
>
> Nope -- privileged users can enable seccomp w/o nnp.

Indeed, I am stupid.

Still it would be nice to cleanup this somehow. The new member is only
used as a previous ->no_new_privs, just it is long to allow the concurent
set/get. Logically it doesn't even belong to seccomp{}.

Oleg.

2014-06-24 19:35:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

On Tue, Jun 24, 2014 at 12:30 PM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Andy Lutomirski wrote:
>>
>> On Tue, Jun 24, 2014 at 12:18 PM, Oleg Nesterov <[email protected]> wrote:
>> >>
>> >> -struct seccomp { };
>> >> +struct seccomp {
>> >> + unsigned long flags;
>> >> +};
>> >
>> > A bit messy ;)
>> >
>> > I am wondering if we can simply do
>> >
>> > static inline bool current_no_new_privs(void)
>> > {
>> > if (current->no_new_privs)
>> > return true;
>> >
>> > #ifdef CONFIG_SECCOMP
>> > if (test_thread_flag(TIF_SECCOMP))
>> > return true;
>> > #endif
>>
>> Nope -- privileged users can enable seccomp w/o nnp.
>
> Indeed, I am stupid.
>
> Still it would be nice to cleanup this somehow. The new member is only
> used as a previous ->no_new_privs, just it is long to allow the concurent
> set/get. Logically it doesn't even belong to seccomp{}.

We could add an unsigned long atomic flags field to task_struct.

Grr. Why isn't there an unsigned *int* atomic bitmask type? Even u64
would be better. unsigned long is useless.

>
> Oleg.
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-24 19:46:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] seccomp: introduce writer locking

On Tue, Jun 24, 2014 at 11:30 AM, Oleg Nesterov <[email protected]> wrote:
> I am puzzled by the usage of smp_load_acquire(),

It was recommended by Andy Lutomirski in preference to ACCESS_ONCE().

> On 06/23, Kees Cook wrote:
>>
>> static u32 seccomp_run_filters(int syscall)
>> {
>> - struct seccomp_filter *f;
>> + struct seccomp_filter *f = smp_load_acquire(&current->seccomp.filter);
>> struct seccomp_data sd;
>> u32 ret = SECCOMP_RET_ALLOW;
>>
>> /* Ensure unexpected behavior doesn't result in failing open. */
>> - if (WARN_ON(current->seccomp.filter == NULL))
>> + if (WARN_ON(f == NULL))
>> return SECCOMP_RET_KILL;
>>
>> populate_seccomp_data(&sd);
>> @@ -186,9 +186,8 @@ static u32 seccomp_run_filters(int syscall)
>> * All filters in the list are evaluated and the lowest BPF return
>> * value always takes priority (ignoring the DATA).
>> */
>> - for (f = current->seccomp.filter; f; f = f->prev) {
>> + for (; f; f = smp_load_acquire(&f->prev)) {
>> u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);
>> -
>> if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>> ret = cur_ret;
>
> OK, in this case the 1st one is probably fine, altgough it is not
> clear to me why it is better than read_barrier_depends().
>
> But why do we need a 2nd one inside the loop? And if we actually need
> it (I don't think so) then why it is safe to use f->prog without
> load_acquire ?

You're right -- it should not be possible for for any of the ->prev
pointers to change.

>> void get_seccomp_filter(struct task_struct *tsk)
>> {
>> - struct seccomp_filter *orig = tsk->seccomp.filter;
>> + struct seccomp_filter *orig = smp_load_acquire(&tsk->seccomp.filter);
>> if (!orig)
>> return;
>
> This one looks unneeded.
>
> First of all, afaics atomic_inc() should work correctly without any barriers,
> otherwise it is buggy. But even this doesn't matter.
>
> With this changes get_seccomp_filter() must be called under ->siglock, it can't
> race with add-filter and thus tsk->seccomp.filter should be stable.

Excellent point, yes. I'll remove that.

>> /* Reference count is bounded by the number of total processes. */
>> @@ -361,7 +364,7 @@ void put_seccomp_filter(struct task_struct *tsk)
>> /* Clean up single-reference branches iteratively. */
>> while (orig && atomic_dec_and_test(&orig->usage)) {
>> struct seccomp_filter *freeme = orig;
>> - orig = orig->prev;
>> + orig = smp_load_acquire(&orig->prev);
>> seccomp_filter_free(freeme);
>> }
>
> This one looks unneeded too. And note that this patch does not add
> smp_load_acquire() to read tsk->seccomp.filter.

Hrm, yes, that should get added.

> atomic_dec_and_test() adds mb(), we do not need more barriers to access
> ->prev ?

Right, same situation as the run_filters loop. Thanks!

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 19:50:08

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

On Tue, Jun 24, 2014 at 12:34 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Jun 24, 2014 at 12:30 PM, Oleg Nesterov <[email protected]> wrote:
>> On 06/24, Andy Lutomirski wrote:
>>>
>>> On Tue, Jun 24, 2014 at 12:18 PM, Oleg Nesterov <[email protected]> wrote:
>>> >>
>>> >> -struct seccomp { };
>>> >> +struct seccomp {
>>> >> + unsigned long flags;
>>> >> +};
>>> >
>>> > A bit messy ;)
>>> >
>>> > I am wondering if we can simply do
>>> >
>>> > static inline bool current_no_new_privs(void)
>>> > {
>>> > if (current->no_new_privs)
>>> > return true;
>>> >
>>> > #ifdef CONFIG_SECCOMP
>>> > if (test_thread_flag(TIF_SECCOMP))
>>> > return true;
>>> > #endif
>>>
>>> Nope -- privileged users can enable seccomp w/o nnp.
>>
>> Indeed, I am stupid.
>>
>> Still it would be nice to cleanup this somehow. The new member is only
>> used as a previous ->no_new_privs, just it is long to allow the concurent
>> set/get. Logically it doesn't even belong to seccomp{}.
>
> We could add an unsigned long atomic flags field to task_struct.

I thought that had gotten shot down originally, but given the current
state of the patch series, it would be effectively identical, since my
earlier attempt at keeping sizes the same (with alternate accessors)
was too messy. I will change this as well.

> Grr. Why isn't there an unsigned *int* atomic bitmask type? Even u64
> would be better. unsigned long is useless.

Useless beyond 32 bits. ;)

-Kees

--
Kees Cook
Chrome OS Security

2014-06-24 19:52:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] seccomp: move no_new_privs into seccomp

On Tue, Jun 24, 2014 at 12:50 PM, Kees Cook <[email protected]> wrote:
> On Tue, Jun 24, 2014 at 12:34 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Jun 24, 2014 at 12:30 PM, Oleg Nesterov <[email protected]> wrote:
>>> On 06/24, Andy Lutomirski wrote:
>>>>
>>>> On Tue, Jun 24, 2014 at 12:18 PM, Oleg Nesterov <[email protected]> wrote:
>>>> >>
>>>> >> -struct seccomp { };
>>>> >> +struct seccomp {
>>>> >> + unsigned long flags;
>>>> >> +};
>>>> >
>>>> > A bit messy ;)
>>>> >
>>>> > I am wondering if we can simply do
>>>> >
>>>> > static inline bool current_no_new_privs(void)
>>>> > {
>>>> > if (current->no_new_privs)
>>>> > return true;
>>>> >
>>>> > #ifdef CONFIG_SECCOMP
>>>> > if (test_thread_flag(TIF_SECCOMP))
>>>> > return true;
>>>> > #endif
>>>>
>>>> Nope -- privileged users can enable seccomp w/o nnp.
>>>
>>> Indeed, I am stupid.
>>>
>>> Still it would be nice to cleanup this somehow. The new member is only
>>> used as a previous ->no_new_privs, just it is long to allow the concurent
>>> set/get. Logically it doesn't even belong to seccomp{}.
>>
>> We could add an unsigned long atomic flags field to task_struct.
>
> I thought that had gotten shot down originally, but given the current
> state of the patch series, it would be effectively identical, since my
> earlier attempt at keeping sizes the same (with alternate accessors)
> was too messy. I will change this as well.
>
>> Grr. Why isn't there an unsigned *int* atomic bitmask type? Even u64
>> would be better. unsigned long is useless.
>
> Useless beyond 32 bits. ;)

It basically guarantees 32 wasted bits on 64-bit systems.

I guess that unsigned long foo[64/BITS_PER_LONG] would work, bit that's hideous.

--Andy

2014-06-24 20:27:03

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] seccomp: introduce writer locking

On Tue, Jun 24, 2014 at 11:35 AM, Oleg Nesterov <[email protected]> wrote:
> On 06/24, Kees Cook wrote:
>> On Tue, Jun 24, 2014 at 9:52 AM, Oleg Nesterov <[email protected]> wrote:
>> >> @@ -1142,6 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> >> {
>> >> int retval;
>> >> struct task_struct *p;
>> >> + unsigned long irqflags;
>> >>
>> >> if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>> >> return ERR_PTR(-EINVAL);
>> >> @@ -1196,7 +1223,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> >> goto fork_out;
>> >>
>> >> ftrace_graph_init_task(p);
>> >> - get_seccomp_filter(p);
>> >>
>> >> rt_mutex_init_task(p);
>> >>
>> >> @@ -1434,7 +1460,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> >> p->parent_exec_id = current->self_exec_id;
>> >> }
>> >>
>> >> - spin_lock(&current->sighand->siglock);
>> >> + spin_lock_irqsave(&current->sighand->siglock, irqflags);
>> >> +
>> >> + /*
>> >> + * Copy seccomp details explicitly here, in case they were changed
>> >> + * before holding tasklist_lock.
>> >> + */
>> >> + copy_seccomp(p);
>> >>
>> >> /*
>> >> * Process group and session signals need to be delivered to just the
>> >> @@ -1446,7 +1478,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> >> */
>> >> recalc_sigpending();
>> >> if (signal_pending(current)) {
>> >> - spin_unlock(&current->sighand->siglock);
>> >> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
>> >> write_unlock_irq(&tasklist_lock);
>> >> retval = -ERESTARTNOINTR;
>> >> goto bad_fork_free_pid;
>> >> @@ -1486,7 +1518,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> >> }
>> >>
>> >> total_forks++;
>> >> - spin_unlock(&current->sighand->siglock);
>> >> + spin_unlock_irqrestore(&current->sighand->siglock, irqflags);
>> >> write_unlock_irq(&tasklist_lock);
>> >> proc_fork_connector(p);
>> >> cgroup_post_fork(p);
>> >
>> > It seems that the only change copy_process() needs is copy_seccomp() under the locks.
>> > Everythinh else (irqflags games) looks obviously unneeded?
>>
>> I got irq lock dep warnings without these changes.
>
> With or without your patches? Could you show the waring?

It seems it's only needed in seccomp itself (I can drop the changes in
kernel/fork.c). I get no warnings in that case. If I also remove irq
handling from seccomp, I see:

[ 17.444328]
[ 17.445031] =================================
[ 17.445031] [ INFO: inconsistent lock state ]
[ 17.445031] 3.16.0-rc2+ #289 Not tainted
[ 17.445031] ---------------------------------
[ 17.445031] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 17.445031] seccomp_bpf_tes/1987 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 17.445031] (&(&sighand->siglock)->rlock){?.....}, at:
[<ffffffff9e2fb7e5>] do_seccomp.part.7+0x25/0xc0
[ 17.445031] {IN-HARDIRQ-W} state was registered at:
[ 17.445031] [<ffffffff9e2a422a>] mark_irqflags+0x19a/0x1b0
[ 17.445031] [<ffffffff9e2a4542>] __lock_acquire+0x302/0x9e0
[ 17.445031] [<ffffffff9e2a5325>] lock_acquire+0x95/0x1e0
[ 17.445031] [<ffffffff9ebda204>] _raw_spin_lock+0x34/0x50
[ 17.445031] [<ffffffff9e263e70>] __lock_task_sighand+0xa0/0x230
[ 17.445031] [<ffffffff9e2652cf>] send_sigqueue+0x3f/0x280
[ 17.445031] [<ffffffff9e2781e3>] posix_timer_event+0x83/0x140
[ 17.445031] [<ffffffff9e2782f2>] posix_timer_fn+0x52/0xd0
[ 17.445031] [<ffffffff9e27d3bc>] __run_hrtimer+0x7c/0x420
[ 17.445031] [<ffffffff9e27de27>] hrtimer_interrupt+0x107/0x260
[ 17.445031] [<ffffffff9e235aa6>] local_apic_timer_interrupt+0x36/0x60
[ 17.445031] [<ffffffff9e235f1e>] smp_apic_timer_interrupt+0x3e/0x60
[ 17.445031] [<ffffffff9ebdbf2f>] apic_timer_interrupt+0x6f/0x80
[ 17.445031] [<ffffffff9e20d99a>] arch_cpu_idle+0xa/0x10
[ 17.445031] [<ffffffff9e29d587>] cpuidle_idle_call+0x157/0x3d0
[ 17.445031] [<ffffffff9e29d945>] cpu_idle_loop+0x145/0x370
[ 17.445031] [<ffffffff9e29dbc6>] cpu_startup_entry+0x56/0x60
[ 17.445031] [<ffffffff9e234544>] start_secondary+0xd4/0xe0
[ 17.445031] irq event stamp: 243
[ 17.445031] hardirqs last enabled at (243): [<ffffffff9e2b9d00>]
__call_rcu.constprop.63+0x70/0x120
[ 17.445031] hardirqs last disabled at (242): [<ffffffff9e2b9cd2>]
__call_rcu.constprop.63+0x42/0x120
[ 17.445031] softirqs last enabled at (50): [<ffffffff9e256310>]
__do_softirq+0x1d0/0x4d0
[ 17.445031] softirqs last disabled at (21): [<ffffffff9e2568be>]
irq_exit+0x8e/0xb0
[ 17.445031]
[ 17.445031] other info that might help us debug this:
[ 17.445031] Possible unsafe locking scenario:
[ 17.445031]
[ 17.445031] CPU0
[ 17.445031] ----
[ 17.445031] lock(&(&sighand->siglock)->rlock);
[ 17.445031] <Interrupt>
[ 17.445031] lock(&(&sighand->siglock)->rlock);
[ 17.445031]
[ 17.445031] *** DEADLOCK ***
[ 17.445031]
[ 17.445031] no locks held by seccomp_bpf_tes/1987.
[ 17.445031]
[ 17.445031] stack backtrace:
[ 17.445031] CPU: 0 PID: 1987 Comm: seccomp_bpf_tes Not tainted
3.16.0-rc2+ #289
[ 17.445031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 17.445031] ffffffff9f71dd90 ffff88007878bc48 ffffffff9ebc4fb4
0000000000000007
[ 17.445031] ffff880077fb3570 ffff88007878bca8 ffffffff9ebbad0d
0000000000000000
[ 17.445031] 0000000000000001 00007fff00000001 ffff88007878bd70
ffff88007878bd18
[ 17.445031] Call Trace:
[ 17.445031] [<ffffffff9ebc4fb4>] dump_stack+0x4e/0x68
[ 17.445031] [<ffffffff9ebbad0d>] print_usage_bug+0x1f1/0x202
[ 17.445031] [<ffffffff9e2a26e0>] ? check_usage_forwards+0x150/0x150
[ 17.445031] [<ffffffff9ebbad8a>] mark_lock_irq+0x6c/0x137
[ 17.445031] [<ffffffff9e2a3ff5>] mark_lock+0x125/0x1c0
[ 17.445031] [<ffffffff9e2a41c8>] mark_irqflags+0x138/0x1b0
[ 17.445031] [<ffffffff9e2a4542>] __lock_acquire+0x302/0x9e0
[ 17.445031] [<ffffffff9e383a0c>] ? create_object+0x21c/0x2d0
[ 17.445031] [<ffffffff9e2a5325>] lock_acquire+0x95/0x1e0
[ 17.445031] [<ffffffff9e2fb7e5>] ? do_seccomp.part.7+0x25/0xc0
[ 17.445031] [<ffffffff9e2a5e55>] ? trace_hardirqs_on_caller+0x105/0x1d0
[ 17.445031] [<ffffffff9ebda204>] _raw_spin_lock+0x34/0x50
[ 17.445031] [<ffffffff9e2fb7e5>] ? do_seccomp.part.7+0x25/0xc0
[ 17.445031] [<ffffffff9e27ffc5>] ? abort_creds+0x45/0x50
[ 17.445031] [<ffffffff9e2fb7e5>] do_seccomp.part.7+0x25/0xc0
[ 17.445031] [<ffffffff9e2fbf28>] do_seccomp+0x18/0x40
[ 17.445031] [<ffffffff9e2fc1df>] prctl_set_seccomp+0x2f/0x40
[ 17.445031] [<ffffffff9e26bce1>] SyS_prctl+0x141/0x4b0
[ 17.445031] [<ffffffff9e2f306c>] ? __audit_syscall_entry+0x8c/0xe0
[ 17.445031] [<ffffffff9e59556e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 17.445031] [<ffffffff9ebdb012>] system_call_fastpath+0x16/0x1b


I'll drop the fork.c changes, and keep the seccomp.c irqflags.

Thanks!

-Kees

--
Kees Cook
Chrome OS Security

2014-06-26 12:37:17

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] seccomp: split filter prep from check and apply

On Mon, Jun 23, 2014 at 02:58:06PM -0700, Kees Cook wrote:
> In preparation for adding seccomp locking, move filter creation away
> from where it is checked and applied. This will allow for locking where
> no memory allocation is happening. The validation, filter attachment,
> and seccomp mode setting can all happen under the future locks.
>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> kernel/seccomp.c | 97 +++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 67 insertions(+), 30 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index afb916c7e890..edc8c79ed16d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -515,6 +551,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
> current->seccomp.mode = seccomp_mode;
> set_thread_flag(TIF_SECCOMP);
> out:
> + seccomp_filter_free(prepared);
> return ret;
> }

I think this needs to be inside #ifdef CONFIG_SECCOMP_FILTER to match
the definition of seccomp_filter_free:

../kernel/seccomp.c:554:2: error: implicit declaration of function ‘seccomp_filter_free’ [-Werror=implicit-function-declaration]

2014-06-27 18:45:32

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] seccomp: split filter prep from check and apply

On Thu, Jun 26, 2014 at 5:37 AM, David Drysdale <[email protected]> wrote:
> On Mon, Jun 23, 2014 at 02:58:06PM -0700, Kees Cook wrote:
>> In preparation for adding seccomp locking, move filter creation away
>> from where it is checked and applied. This will allow for locking where
>> no memory allocation is happening. The validation, filter attachment,
>> and seccomp mode setting can all happen under the future locks.
>>
>> Signed-off-by: Kees Cook <[email protected]>
>> ---
>> kernel/seccomp.c | 97 +++++++++++++++++++++++++++++++++++++-----------------
>> 1 file changed, 67 insertions(+), 30 deletions(-)
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index afb916c7e890..edc8c79ed16d 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -515,6 +551,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
>> current->seccomp.mode = seccomp_mode;
>> set_thread_flag(TIF_SECCOMP);
>> out:
>> + seccomp_filter_free(prepared);
>> return ret;
>> }
>
> I think this needs to be inside #ifdef CONFIG_SECCOMP_FILTER to match
> the definition of seccomp_filter_free:
>
> ../kernel/seccomp.c:554:2: error: implicit declaration of function ‘seccomp_filter_free’ [-Werror=implicit-function-declaration]

Thanks for catching that! I've ended up rearranging the patch series
so the prepare/attach split happens after I've split the set_mode
functions now, so I've managed to avoid this condition now. :)

-Kees

--
Kees Cook
Chrome OS Security