2005-01-21 10:08:03

by Andrea Arcangeli

[permalink] [raw]
Subject: seccomp for 2.6.11-rc1-bk8

Hello,

This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
Cpushare (until trusted computing will hit the hardware market). This is
against 2.6.11-rc1-bk8. The progress is on schedule so far, so it might
not be a bad idea to merge this into the kernel sooner than later, so that
there will be some significant userbase capable of running the Cpushare client
as soon as it becomes available (plus I won't have to forward port the patch
all the time ;). Getting this merged anytime before the end of 2005 is going to
be fundamental for my project if my forecasts will turn out to be correct
(which is not guaranteed, but if I'm wrong that could also mean I need it
sooner ;), but anyway there is no short term urgency, so even 2.6.12/13 will be
ok, but if you can merge it now it's even better and it'll certainly save me
some time.

I remember you asked for syscalls, I can add them but I wouldn't mind to
be able to get/set the value still from the /proc API. I don't really
feel the need of syscalls, this is all but a fast path. The overhead of
creating the pipes and forking would be significant too. Ideally I could
add syscalls to make it easier to use in chroot environment (just in
case someone feels the need to stack seccomp on top of chroot), that's
the only reason why syscalls might ever be useful. But this is still is
nice to have in /proc at least in readonly mode, so I see the current
patch as a good starting point and as valid code for the long term (not
overlapping with syscalls since `cat/echo` cannot be used with the syscalls).

As usual this is theoretically useful to run any kind of untrusted
bytecode on the computer. This means also code that might have bugs.
Like to decompress a mpeg stream securely regardless of the decoder lib,
or stuff like that. I've no idea if somebody is going to use it for that
though. I only know I'm going to use it with Cpushare 8).

Works for me:

andrea@dualathlon:~/cpushare/client/cpushare> python seccomp_test.py
gcc -march=i686 -Os -Wall -fomit-frame-pointer -fno-common seccomp-loader.c -o seccomp-loader
gcc -c -march=i686 -Os -Wall -fomit-frame-pointer -fno-common bytecode.c -o bytecode.o
cpp bytecode.lds.S -o bytecode.lds.s
grep -A100000000 SECTION bytecode.lds.s > bytecode.lds
ld -T bytecode.lds bytecode.o /usr/lib/gcc-lib/i586-suse-linux/3.3.4/libgcc.a /usr/lib/libc.a /usr/lib/libm.a -N -o bytecode
objcopy -O binary bytecode -j .text bytecode.text.bin
objcopy -O binary bytecode -j .data bytecode.data.bin
gcc -c -march=i686 -Os -Wall -fomit-frame-pointer -fno-common -DMALICIOUS bytecode.c -o bytecode-malicious.o
ld -T bytecode.lds bytecode-malicious.o /usr/lib/gcc-lib/i586-suse-linux/3.3.4/libgcc.a /usr/lib/libc.a /usr/lib/libm.a -N -o bytecode-malicious
objcopy -O binary bytecode-malicious -j .text bytecode-malicious.text.bin
objcopy -O binary bytecode-malicious -j .data bytecode-malicious.data.bin
Starting computing some malicious bytecode
init
load
start
stop
receive_data failure
kill
exit_code 0 signal 9
The malicious bytecode has been killed successfully by seccomp
Starting computing some safe bytecode
init
load
start
stop
1509 counts
kill
exit_code 0 signal 0
The seccomp_test.py completed successfully, thank you for testing.
andrea@dualathlon:~/cpushare/client/cpushare>

Thanks.

--- xxx/arch/i386/Kconfig 2005-01-21 09:14:54.000000000 +0100
+++ xx/arch/i386/Kconfig 2005-01-21 09:07:57.000000000 +0100
@@ -33,6 +33,10 @@ config GENERIC_IOMAP
bool
default y

+config SECCOMP
+ bool
+ default y
+
source "init/Kconfig"

menu "Processor type and features"
--- xxx/arch/i386/kernel/entry.S 2005-01-15 20:44:49.000000000 +0100
+++ xx/arch/i386/kernel/entry.S 2005-01-21 09:07:57.000000000 +0100
@@ -221,7 +221,8 @@ sysenter_past_esp:
SAVE_ALL
GET_THREAD_INFO(%ebp)

- testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
+ /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
+ testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax
jae syscall_badsys
@@ -245,7 +246,8 @@ ENTRY(system_call)
SAVE_ALL
GET_THREAD_INFO(%ebp)
# system call tracing in operation
- testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
+ /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
+ testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax
jae syscall_badsys
--- xxx/arch/i386/kernel/ptrace.c 2005-01-15 20:44:49.000000000 +0100
+++ xx/arch/i386/kernel/ptrace.c 2005-01-21 09:07:57.000000000 +0100
@@ -15,6 +15,7 @@
#include <linux/user.h>
#include <linux/security.h>
#include <linux/audit.h>
+#include <linux/seccomp.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -678,6 +679,10 @@ void send_sigtrap(struct task_struct *ts
__attribute__((regparm(3)))
void do_syscall_trace(struct pt_regs *regs, int entryexit)
{
+ /* do the secure computing check first */
+ if (unlikely(test_thread_flag(TIF_SECCOMP)))
+ secure_computing(regs->orig_eax);
+
if (unlikely(current->audit_context)) {
if (!entryexit)
audit_syscall_entry(current, regs->orig_eax,
--- xxx/arch/x86_64/ia32/ia32entry.S 2005-01-21 09:14:54.000000000 +0100
+++ xx/arch/x86_64/ia32/ia32entry.S 2005-01-21 09:07:57.000000000 +0100
@@ -78,7 +78,7 @@ ENTRY(ia32_sysenter_target)
.quad 1b,ia32_badarg
.previous
GET_THREAD_INFO(%r10)
- testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),threadinfo_flags(%r10)
+ testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),threadinfo_flags(%r10)
jnz sysenter_tracesys
sysenter_do_call:
cmpl $(IA32_NR_syscalls),%eax
@@ -163,7 +163,7 @@ ENTRY(ia32_cstar_target)
.quad 1b,ia32_badarg
.previous
GET_THREAD_INFO(%r10)
- testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),threadinfo_flags(%r10)
+ testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),threadinfo_flags(%r10)
jnz cstar_tracesys
cstar_do_call:
cmpl $IA32_NR_syscalls,%eax
@@ -236,7 +236,7 @@ ENTRY(ia32_syscall)
this could be a problem. */
SAVE_ARGS 0,0,1
GET_THREAD_INFO(%r10)
- testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),threadinfo_flags(%r10)
+ testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),threadinfo_flags(%r10)
jnz ia32_tracesys
ia32_do_syscall:
cmpl $(IA32_NR_syscalls),%eax
--- xxx/arch/x86_64/Kconfig 2005-01-15 20:44:50.000000000 +0100
+++ xx/arch/x86_64/Kconfig 2005-01-21 09:07:57.000000000 +0100
@@ -87,6 +87,10 @@ config GENERIC_IOMAP
bool
default y

+config SECCOMP
+ bool
+ default y
+
source "init/Kconfig"


--- xxx/arch/x86_64/kernel/entry.S 2005-01-15 20:44:50.000000000 +0100
+++ xx/arch/x86_64/kernel/entry.S 2005-01-21 09:07:57.000000000 +0100
@@ -185,7 +185,7 @@ ENTRY(system_call)
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
movq %rcx,RIP-ARGOFFSET(%rsp)
GET_THREAD_INFO(%rcx)
- testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),threadinfo_flags(%rcx)
+ testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),threadinfo_flags(%rcx)
jnz tracesys
cmpq $__NR_syscall_max,%rax
ja badsys
--- xxx/arch/x86_64/kernel/ptrace.c 2005-01-04 01:13:11.000000000 +0100
+++ xx/arch/x86_64/kernel/ptrace.c 2005-01-21 09:07:57.000000000 +0100
@@ -17,6 +17,7 @@
#include <linux/user.h>
#include <linux/security.h>
#include <linux/audit.h>
+#include <linux/seccomp.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -521,6 +522,10 @@ static void syscall_trace(struct pt_regs

asmlinkage void syscall_trace_enter(struct pt_regs *regs)
{
+ /* do the secure computing check first */
+ if (unlikely(test_thread_flag(TIF_SECCOMP)))
+ secure_computing(regs->orig_rax);
+
if (unlikely(current->audit_context))
audit_syscall_entry(current, regs->orig_rax,
regs->rdi, regs->rsi,
--- xxx/fs/proc/base.c 2005-01-15 20:44:58.000000000 +0100
+++ xx/fs/proc/base.c 2005-01-21 09:07:57.000000000 +0100
@@ -32,6 +32,9 @@
#include <linux/mount.h>
#include <linux/security.h>
#include <linux/ptrace.h>
+#ifdef CONFIG_SECCOMP
+#include <linux/seccomp.h>
+#endif
#include "internal.h"

/*
@@ -49,6 +52,9 @@ enum pid_directory_inos {
PROC_TGID_TASK,
PROC_TGID_STATUS,
PROC_TGID_MEM,
+#ifdef CONFIG_SECCOMP
+ PROC_TGID_SECCOMP,
+#endif
PROC_TGID_CWD,
PROC_TGID_ROOT,
PROC_TGID_EXE,
@@ -75,6 +81,9 @@ enum pid_directory_inos {
PROC_TID_INO,
PROC_TID_STATUS,
PROC_TID_MEM,
+#ifdef CONFIG_SECCOMP
+ PROC_TID_SECCOMP,
+#endif
PROC_TID_CWD,
PROC_TID_ROOT,
PROC_TID_EXE,
@@ -120,6 +129,9 @@ static struct pid_entry tgid_base_stuff[
E(PROC_TGID_STATM, "statm", S_IFREG|S_IRUGO),
E(PROC_TGID_MAPS, "maps", S_IFREG|S_IRUGO),
E(PROC_TGID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
+#ifdef CONFIG_SECCOMP
+ E(PROC_TGID_SECCOMP, "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
+#endif
E(PROC_TGID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_ROOT, "root", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE, "exe", S_IFLNK|S_IRWXUGO),
@@ -145,6 +157,9 @@ static struct pid_entry tid_base_stuff[]
E(PROC_TID_STATM, "statm", S_IFREG|S_IRUGO),
E(PROC_TID_MAPS, "maps", S_IFREG|S_IRUGO),
E(PROC_TID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
+#ifdef CONFIG_SECCOMP
+ E(PROC_TID_SECCOMP, "seccomp", S_IFREG|S_IRUSR|S_IWUSR),
+#endif
E(PROC_TID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
E(PROC_TID_ROOT, "root", S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE, "exe", S_IFLNK|S_IRWXUGO),
@@ -661,6 +676,60 @@ static struct inode_operations proc_mem_
.permission = proc_permission,
};

+#ifdef CONFIG_SECCOMP
+static ssize_t seccomp_read(struct file * file, char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct * tsk = proc_task(file->f_dentry->d_inode);
+ char __buf[20];
+ loff_t __ppos = *ppos;
+ size_t len;
+
+ len = sprintf(__buf, "%u\n", tsk->seccomp_mode) + 1;
+ if (__ppos >= len)
+ return 0;
+ if (count > len-__ppos)
+ count = len-__ppos;
+ if (copy_to_user(buf, __buf + __ppos, count))
+ return -EFAULT;
+ *ppos += count;
+ return count;
+}
+
+static ssize_t seccomp_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct * tsk = proc_task(file->f_dentry->d_inode);
+ char __buf[20], * end;
+ unsigned int seccomp_mode;
+
+ /* can set it only once to be even more secure */
+ if (unlikely(tsk->seccomp_mode))
+ return -EPERM;
+
+ memset(__buf, 0, 20);
+ if (count > 19)
+ count = 19;
+ if (copy_from_user(__buf, buf, count))
+ return -EFAULT;
+ seccomp_mode = simple_strtoul(__buf, &end, 0);
+ if (*end == '\n')
+ end++;
+ if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
+ tsk->seccomp_mode = seccomp_mode;
+ set_tsk_thread_flag(tsk, TIF_SECCOMP);
+ }
+ if (unlikely(!(end - __buf)))
+ return -EIO;
+ return end - __buf;
+}
+
+static struct file_operations proc_seccomp_operations = {
+ .read = seccomp_read,
+ .write = seccomp_write,
+};
+#endif /* CONFIG_SECCOMP */
+
static int proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
{
struct inode *inode = dentry->d_inode;
@@ -1296,6 +1365,12 @@ static struct dentry *proc_pident_lookup
inode->i_op = &proc_mem_inode_operations;
inode->i_fop = &proc_mem_operations;
break;
+#ifdef CONFIG_SECCOMP
+ case PROC_TID_SECCOMP:
+ case PROC_TGID_SECCOMP:
+ inode->i_fop = &proc_seccomp_operations;
+ break;
+#endif /* CONFIG_SECCOMP */
case PROC_TID_MOUNTS:
case PROC_TGID_MOUNTS:
inode->i_fop = &proc_mounts_operations;
--- xxx/include/asm-i386/thread_info.h 2005-01-04 01:13:27.000000000 +0100
+++ xx/include/asm-i386/thread_info.h 2005-01-21 09:07:57.000000000 +0100
@@ -140,6 +140,7 @@ register unsigned long current_stack_poi
#define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */
#define TIF_IRET 5 /* return with iret */
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
+#define TIF_SECCOMP 8 /* secure computing */
#define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
@@ -149,12 +150,14 @@ register unsigned long current_stack_poi
#define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP)
#define _TIF_IRET (1<<TIF_IRET)
#define _TIF_SYSCALL_AUDIT (1<<TIF_SYSCALL_AUDIT)
+#define _TIF_SECCOMP (1<<TIF_SECCOMP)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
- (0x0000FFFF & ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP))
-#define _TIF_ALLWORK_MASK 0x0000FFFF /* work to do on any return to u-space */
+ (0x0000FFFF & ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP|_TIF_SECCOMP))
+/* work to do on any return to u-space */
+#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)

/*
* Thread-synchronous status.
--- xxx/include/asm-x86_64/thread_info.h 2005-01-04 01:13:29.000000000 +0100
+++ xx/include/asm-x86_64/thread_info.h 2005-01-21 09:07:57.000000000 +0100
@@ -102,6 +102,7 @@ static inline struct thread_info *stack_
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_IRET 5 /* force IRET */
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
+#define TIF_SECCOMP 8 /* secure computing */
#define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -114,6 +115,7 @@ static inline struct thread_info *stack_
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
#define _TIF_IRET (1<<TIF_IRET)
#define _TIF_SYSCALL_AUDIT (1<<TIF_SYSCALL_AUDIT)
+#define _TIF_SECCOMP (1<<TIF_SECCOMP)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)
#define _TIF_IA32 (1<<TIF_IA32)
#define _TIF_FORK (1<<TIF_FORK)
@@ -121,9 +123,9 @@ static inline struct thread_info *stack_

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
- (0x0000FFFF & ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP))
+ (0x0000FFFF & ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP|_TIF_SECCOMP))
/* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK 0x0000FFFF
+#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)

#define PREEMPT_ACTIVE 0x10000000

--- xxx/include/linux/sched.h 2005-01-21 09:14:55.000000000 +0100
+++ xx/include/linux/sched.h 2005-01-21 09:07:57.000000000 +0100
@@ -643,6 +643,7 @@ struct task_struct {

void *security;
struct audit_context *audit_context;
+ unsigned int seccomp_mode;

/* Thread group tracking */
u32 parent_exec_id;
--- xxx/include/linux/seccomp.h 1970-01-01 01:00:00.000000000 +0100
+++ xx/include/linux/seccomp.h 2005-01-21 09:07:57.000000000 +0100
@@ -0,0 +1,8 @@
+#ifndef _LINUX_SECCOMP_H
+#define _LINUX_SECCOMP_H
+
+#define NR_SECCOMP_MODES 1
+
+extern void secure_computing(int);
+
+#endif /* _LINUX_SECCOMP_H */
--- xxx/kernel/Makefile 2005-01-04 01:13:30.000000000 +0100
+++ xx/kernel/Makefile 2005-01-21 09:07:57.000000000 +0100
@@ -7,7 +7,7 @@ obj-y = sched.o fork.o exec_domain.o
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
- kthread.o wait.o kfifo.o sys_ni.o
+ kthread.o wait.o kfifo.o sys_ni.o seccomp.o

obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
--- xxx/kernel/seccomp.c 1970-01-01 01:00:00.000000000 +0100
+++ xx/kernel/seccomp.c 2005-01-21 09:07:57.000000000 +0100
@@ -0,0 +1,74 @@
+/*
+ * linux/kernel/seccomp.c
+ *
+ * Copyright 2004-2005 Andrea Arcangeli <[email protected]>
+ *
+ * This defines a simple but solid secure-computing mode.
+ */
+
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+#include <asm/unistd.h>
+#ifdef TIF_IA32
+#include <asm/ia32_unistd.h>
+#endif
+
+/* #define SECCOMP_DEBUG 1 */
+
+/*
+ * Secure computing mode 1 allows only read/write/exit/sigreturn.
+ * To be fully secure this must be combined with rlimit
+ * to limit the stack allocations too.
+ */
+static int mode1_syscalls[] = {
+ __NR_read, __NR_write, __NR_exit,
+ /*
+ * Allow either sigreturn or rt_sigreturn, newer archs
+ * like x86-64 only defines __NR_rt_sigreturn.
+ */
+#ifdef __NR_sigreturn
+ __NR_sigreturn,
+#else
+ __NR_rt_sigreturn,
+#endif
+ 0, /* null terminated */
+};
+
+#ifdef TIF_IA32
+static int mode1_syscalls_32bit[] = {
+ __NR_ia32_read, __NR_ia32_write, __NR_ia32_exit,
+ /*
+ * Allow either sigreturn or rt_sigreturn, newer archs
+ * like x86-64 only defines __NR_rt_sigreturn.
+ */
+ __NR_ia32_sigreturn,
+ 0, /* null terminated */
+};
+#endif
+
+void secure_computing(int this_syscall)
+{
+ int mode = current->seccomp_mode;
+ int * syscall;
+
+ switch (mode) {
+ case 1:
+ syscall = mode1_syscalls;
+#ifdef TIF_IA32
+ if (test_thread_flag(TIF_IA32))
+ syscall = mode1_syscalls_32bit;
+#endif
+ do {
+ if (*syscall == this_syscall)
+ return;
+ } while (*++syscall);
+ break;
+ default:
+ BUG();
+ }
+
+#ifdef SECCOMP_DEBUG
+ dump_stack();
+#endif
+ do_exit(SIGKILL);
+}


2005-01-21 12:03:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8


* Andrea Arcangeli <[email protected]> wrote:

> This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> Cpushare (until trusted computing will hit the hardware market).
> [...]

why do you need any kernel code for this? This seems to be a limited
ptrace implementation: restricting untrusted userspace code to only be
able to exec read/write/sigreturn.

So this patch, unless i'm missing something, duplicates in essence what
ptrace can do already here and today, on any Linux box, on any CPU. You
can implement your client based on ptrace alone, just like UML does it -
and UML has much more complex needs than secure isolation.

ptrace ought to be perfectly fine for this, it traps every attempt to do
something privileged. [ptrace had its share of security problems but
_not_ many (if any at all) security problems that allowed a ptrace
client to _break out_ of a ptrace jail.]

Ingo

2005-01-21 12:11:40

by Pavel Machek

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Hi!

> This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> Cpushare (until trusted computing will hit the hardware market). This is
> against 2.6.11-rc1-bk8. The progress is on schedule so far, so it
> might

It needs entry in Documentation/ at least.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-21 12:48:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8


* Ingo Molnar <[email protected]> wrote:

> > This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> > Cpushare (until trusted computing will hit the hardware market).
> > [...]
>
> why do you need any kernel code for this? This seems to be a limited
> ptrace implementation: restricting untrusted userspace code to only be
> able to exec read/write/sigreturn.
>
> So this patch, unless i'm missing something, duplicates in essence what
> ptrace can do [...]

there's one thing ptrace wont do: if the ptrace parent dies unexpectedly
and the child was 'running' (there is a small window where the child
might not be stopped and where this may happen) then the child can get
runaway. While i think this is theoretical (UML doesnt suffer from this
problem), it is simple to fix - find below a proof-of-concept patch that
introduces PTRACE_ATTACH_JAIL - ptraced children can never escape out of
such a jail. (barely tested - but you get the idea.)

Ingo

Signed-off-by: Ingo Molnar <[email protected]>

--- kernel/ptrace.c.orig
+++ kernel/ptrace.c
@@ -49,10 +49,20 @@ void ptrace_untrace(task_t *child)
{
spin_lock(&child->sighand->siglock);
if (child->state == TASK_TRACED) {
- if (child->signal->flags & SIGNAL_STOP_STOPPED) {
+ /*
+ * Child must be killed if parent dies unexpectedly:
+ */
+ if (child->signal->flags & SIGNAL_PTRACE_ONLY) {
child->state = TASK_STOPPED;
- } else {
+ spin_unlock(&child->sighand->siglock);
+ force_sig_specific(SIGKILL, child);
signal_wake_up(child, 1);
+ } else {
+ if (child->signal->flags & SIGNAL_STOP_STOPPED) {
+ child->state = TASK_STOPPED;
+ } else {
+ signal_wake_up(child, 1);
+ }
}
}
spin_unlock(&child->sighand->siglock);
@@ -117,7 +127,7 @@ int ptrace_check_attach(struct task_stru
return ret;
}

-int ptrace_attach(struct task_struct *task)
+static int __ptrace_attach(struct task_struct *task, int jail)
{
int retval;
task_lock(task);
@@ -154,8 +164,12 @@ int ptrace_attach(struct task_struct *ta

write_lock_irq(&tasklist_lock);
__ptrace_link(task, current);
+ if (jail) {
+ spin_lock(&task->sighand->siglock);
+ task->signal->flags |= SIGNAL_PTRACE_ONLY;
+ spin_unlock(&task->sighand->siglock);
+ }
write_unlock_irq(&tasklist_lock);
-
force_sig_specific(SIGSTOP, task);
return 0;

@@ -164,6 +178,16 @@ bad:
return retval;
}

+int ptrace_attach(struct task_struct *task)
+{
+ return __ptrace_attach(task, 0);
+}
+
+int ptrace_attach_jail(struct task_struct *task)
+{
+ return __ptrace_attach(task, 1);
+}
+
int ptrace_detach(struct task_struct *child, unsigned int data)
{
if ((unsigned long) data > _NSIG)
--- arch/i386/kernel/ptrace.c.orig
+++ arch/i386/kernel/ptrace.c
@@ -388,6 +388,10 @@ asmlinkage int sys_ptrace(long request,
ret = ptrace_attach(child);
goto out_tsk;
}
+ if (request == PTRACE_ATTACH_JAIL) {
+ ret = ptrace_attach_jail(child);
+ goto out_tsk;
+ }

ret = ptrace_check_attach(child, request == PTRACE_KILL);
if (ret < 0)
--- include/linux/ptrace.h.orig
+++ include/linux/ptrace.h
@@ -18,6 +18,7 @@

#define PTRACE_ATTACH 0x10
#define PTRACE_DETACH 0x11
+#define PTRACE_ATTACH_JAIL 0x12

#define PTRACE_SYSCALL 24

@@ -79,6 +80,7 @@
extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len);
extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len);
extern int ptrace_attach(struct task_struct *tsk);
+extern int ptrace_attach_jail(struct task_struct *tsk);
extern int ptrace_detach(struct task_struct *, unsigned int);
extern void ptrace_disable(struct task_struct *);
extern int ptrace_check_attach(struct task_struct *task, int kill);
--- include/linux/sched.h.orig
+++ include/linux/sched.h
@@ -338,6 +338,7 @@ struct signal_struct {
#define SIGNAL_STOP_DEQUEUED 0x00000002 /* stop signal dequeued */
#define SIGNAL_STOP_CONTINUED 0x00000004 /* SIGCONT since WCONTINUED reap */
#define SIGNAL_GROUP_EXIT 0x00000008 /* group exit in progress */
+#define SIGNAL_PTRACE_ONLY 0x00000010 /* kill on ptrace parent death */


/*

2005-01-21 12:56:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8


* Ingo Molnar <[email protected]> wrote:

> > > This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> > > Cpushare (until trusted computing will hit the hardware market).
> > > [...]
> >
> > why do you need any kernel code for this? This seems to be a limited
> > ptrace implementation: restricting untrusted userspace code to only be
> > able to exec read/write/sigreturn.
> >
> > So this patch, unless i'm missing something, duplicates in essence what
> > ptrace can do [...]
>
> there's one thing ptrace wont do: if the ptrace parent dies
> unexpectedly and the child was 'running' (there is a small window
> where the child might not be stopped and where this may happen) then
> the child can get runaway. While i think this is theoretical (UML
> doesnt suffer from this problem), it is simple to fix - find below a
> proof-of-concept patch that introduces PTRACE_ATTACH_JAIL - ptraced
> children can never escape out of such a jail. (barely tested - but you
> get the idea.)

maybe this could even be fit into existing ptrace semantics, without any
need for PTRACE_ATTACH_JAIL. What we need is to catch the case where a
ptraced child is running (i.e. the signal_wake_up() has already been
done, and the parent is waiting for the child to stop again), and the
ptrace parent is killed unexpectedly. Would it be a correct fix to just
unconditionally stop the child in this case (and leave it hanging in
such a state)? Or to kill it right away?

Ingo

2005-01-21 17:39:22

by Chris Wright

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

* Ingo Molnar ([email protected]) wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> > Cpushare (until trusted computing will hit the hardware market).
> > [...]
>
> why do you need any kernel code for this? This seems to be a limited
> ptrace implementation: restricting untrusted userspace code to only be
> able to exec read/write/sigreturn.
>
> So this patch, unless i'm missing something, duplicates in essence what
> ptrace can do already here and today, on any Linux box, on any CPU. You
> can implement your client based on ptrace alone, just like UML does it -
> and UML has much more complex needs than secure isolation.

Only difference is in number of context switches, and number of running
processes (and perhaps ease of determining policy for which syscalls
are allowed). Although it's not really seccomp, it's just restricted
syscalls...

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2005-01-21 18:39:27

by Rik van Riel

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Fri, 21 Jan 2005, Chris Wright wrote:
> * Ingo Molnar ([email protected]) wrote:

>> why do you need any kernel code for this? This seems to be a limited
>> ptrace implementation: restricting untrusted userspace code to only be
>> able to exec read/write/sigreturn.
>
> Only difference is in number of context switches, and number of running
> processes (and perhaps ease of determining policy for which syscalls
> are allowed). Although it's not really seccomp, it's just restricted
> syscalls...

Yes, but do you care about the performance of syscalls
which the program isn't allowed to call at all ? ;)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-01-21 18:50:11

by Chris Wright

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

* Rik van Riel ([email protected]) wrote:
> Yes, but do you care about the performance of syscalls
> which the program isn't allowed to call at all ? ;)

Heh, no, but it's for every syscall not just denied ones. Point is
simply that ptrace (complexity aside) doesn't scale the same.

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2005-01-21 19:01:11

by daw

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Chris Wright wrote:
>Only difference is in number of context switches, and number of running
>processes (and perhaps ease of determining policy for which syscalls
>are allowed). Although it's not really seccomp, it's just restricted
>syscalls...

There is a simple tweak to ptrace which fixes that: one could add an
API to specify a set of syscalls that ptrace should not trap on. To get
seccomp-like semantics, the user program could specify {read,write}, but
if the user program ever wants to change its policy, it could change that
set. Solaris /proc (which is what is used for tracing) has this feature.
I coded up such an extension to ptrace semantics a long time ago, and
it seemed to work fine for me, though of course I am not a ptrace expert.

I don't know whether ptrace + this tweak is a better idea than seccomp.
It is just another option out there that achieves similar functionality.

2005-01-21 19:18:23

by Chris Wright

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

* David Wagner ([email protected]) wrote:
> There is a simple tweak to ptrace which fixes that: one could add an
> API to specify a set of syscalls that ptrace should not trap on. To get
> seccomp-like semantics, the user program could specify {read,write}, but
> if the user program ever wants to change its policy, it could change that
> set. Solaris /proc (which is what is used for tracing) has this feature.
> I coded up such an extension to ptrace semantics a long time ago, and
> it seemed to work fine for me, though of course I am not a ptrace expert.

Hmm, yeah, that'd be nice. That only leaves the issue of tracer dying
(say from that crazy oom killer ;-).

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2005-01-21 19:55:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8


* Chris Wright <[email protected]> wrote:

> * Rik van Riel ([email protected]) wrote:
> > Yes, but do you care about the performance of syscalls
> > which the program isn't allowed to call at all ? ;)
>
> Heh, no, but it's for every syscall not just denied ones. Point is
> simply that ptrace (complexity aside) doesn't scale the same.

seccomp is about CPU-intense calculation jobs - the only syscalls
allowed are read/write (and sigreturn). UML implements a full kernel
via ptrace and CPU-intense applications run at native speed.

Ingo

2005-01-21 20:25:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Fri, Jan 21, 2005 at 01:47:01PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> > > Cpushare (until trusted computing will hit the hardware market).
> > > [...]
> >
> > why do you need any kernel code for this? This seems to be a limited
> > ptrace implementation: restricting untrusted userspace code to only be
> > able to exec read/write/sigreturn.
> >
> > So this patch, unless i'm missing something, duplicates in essence what
> > ptrace can do [...]
>
> there's one thing ptrace wont do: if the ptrace parent dies unexpectedly
> and the child was 'running' (there is a small window where the child

You got it, I couldn't use ptrace right now. Pavel already suggested it
and I told him the problem with the parent being killed by oom.

> might not be stopped and where this may happen) then the child can get
> runaway. While i think this is theoretical (UML doesnt suffer from this
> problem), it is simple to fix - find below a proof-of-concept patch that
> introduces PTRACE_ATTACH_JAIL - ptraced children can never escape out of
> such a jail. (barely tested - but you get the idea.)

IMHO the complexity of ptrace makes it by definition less secure than
seccomp. Seccomp is extremely simple and self contained. This is why I
still prefer seccomp to fixing ptrace w.r.t. security.

Fixing ptrace w.r.t. security-tracing it'd be still nice, but I'd prefer
not to relay on ptrace when something as simple and robust as seccomp
can be implemented instead.

However if the kerneel folks wants me to use a "fixed version of
ptrace", I could use it too (performance isn't the issue). In _theory_
you're right it'd be completely equivalent after fixing the problem with
the parent dying unexpectedly. But from my part in practice I prefer to
relay _only_ on the much simpler seccomp patch (and on trusted computing
as soon as the hardware is available).

Even trusted computing will be less secure than seccomp from the point
of view of the seller (because it's a lot more complicated than
seccomp), but unlike with ptrace, the buyer will get both privacy
guarantees and guarantees about reliably results too (only against
software attacks). Having those two guarantees for the buyer will be
fundamental, so it will worth to decrease the seller security a bit to
give these guarantees to the buyer (I'll most certainly leave an
exchange for seccomp at the same time I start the trusted computing
exchange, so if some seller doesn't trust the trusted computing code,
they can stick with the very secure seccomp approach), but right now,
seccomp seems the most secure solution from the seller standpoint, and
the buyer won't notice the difference between ptrace and seccomp.

Thanks.

2005-01-21 20:36:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Fri, Jan 21, 2005 at 08:55:22PM +0100, Ingo Molnar wrote:
>
> * Chris Wright <[email protected]> wrote:
>
> > * Rik van Riel ([email protected]) wrote:
> > > Yes, but do you care about the performance of syscalls
> > > which the program isn't allowed to call at all ? ;)
> >
> > Heh, no, but it's for every syscall not just denied ones. Point is
> > simply that ptrace (complexity aside) doesn't scale the same.
>
> seccomp is about CPU-intense calculation jobs - the only syscalls
> allowed are read/write (and sigreturn). UML implements a full kernel
> via ptrace and CPU-intense applications run at native speed.

Indeed. Performance is not an issue (in the short term at least, since
those syscalls will be probably network bound).

The only reason I couldn't use ptrace is what you found, that is the oom
killing of the parent (or a mistake of the CPU seller that kills it by
mistake by hand, I must prevent him to screw himself ;). Even after
fixing ptrace, I've an hard time to prefer ptrace, when a simple,
localized and self contained solution like seccomp is available.

The reason I called it seccomp and not restricted syscalls, is that I'm
not allowing Chris to choose which syscall to restrict. I restricted
only the ones that are required to be able to compute securely, hence
the name "seccomp" and not "restricted syscalls". Obviously I'm
restricting certain number of syscalls to create this seccomp mode.

I'm open to different solutions, I can even live with you forcing me to
use the fixed version of ptrace, but you must be confortable to take the
blame if it breaks ;). Personally I'm confortable to take the blame only
if seccomp breaks, it's so simple that it can't break. And with break I
don't mean 0xf00f, that's a minor issue that will be autodetected by the
system. I mean breaking like killing the ptrace parent right now... That
can be fixed up reasonably securely too, but it _can't_ be autodetected
easily (I keep cross logs for everything so I can trace it, but it
won't be an immediate/automated task like the 0xf00f or fcnlex).

2005-01-21 20:55:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8


* Andrea Arcangeli <[email protected]> wrote:

> Indeed. Performance is not an issue (in the short term at least, since
> those syscalls will be probably network bound).
>
> The only reason I couldn't use ptrace is what you found, that is the
> oom killing of the parent (or a mistake of the CPU seller that kills
> it by mistake by hand, I must prevent him to screw himself ;). Even
> after fixing ptrace, I've an hard time to prefer ptrace, when a
> simple, localized and self contained solution like seccomp is
> available.

ptrace security problems are a matter of fact, but there are really two
types of security barriers related to ptrace:

- first there is the security barrier between a ptrace-ing task (the
parent) and the kernel/CPU state itself. Ptrace deals with lots of
complex x86 CPU details, which results in complex ptrace APIs and
semantics, plus ptrace itself is mainly used by two apps (gdb and
strace) which use it in a very controlled and synchronous way: all
this combined resulted in a fair share of bugs/races that allowed
unprivileged users to break into the kernel via abuse of the ptrace
APIs.

- the second barrier is the 'jail' of the ptraced task. Especially with
PTRACE_SYSCALL, the things a child ptraced process can do are
extremely limited, everything it tries to do will trap, the task will
suspend and the parent runs. The task is completely passive and ptrace
on that end is a pretty small engine that stops/traps/restarts user
processing without alot of frills.

historically there has been alot less problems with the second barrier.
(in fact i cannot remember even one security issue in that area.)

> I'm open to different solutions, I can even live with you forcing me
> to use the fixed version of ptrace, but you must be confortable to
> take the blame if it breaks ;). [...]

i'm not forcing anyone to do anything, but i think the most logical
solution is to use ptrace. It's there on every Linux box so your client
can run even on 'older' Linux boxes. (You might want to detect in the
client whether the OOM race is fixed in a kernel, but it should not be a
truly big issue.) Waiting for any extra API to get significant userbase
takes at least 1-2 years - while ptrace is here and available on every
Linux box. If you require 'users' to go with a new (or worse: patched)
kernel then you are creating a pretty significant artificial market
penetration barrier for your application.

also, with more applications relying on ptrace it will become more
tested, more robust and people will do speedups. I think the fact that
UML uses ptrace is already a very good sign that it's robust for such
purposes. (_Also_, if there's a security problem in the ptrace barrier,
you'd like to know about it and fix it ASAP - SuSE ships UML, right?)

> [...] Personally I'm confortable to take the blame only if seccomp
> breaks, it's so simple that it can't break. And with break I don't
> mean 0xf00f, that's a minor issue that will be autodetected by the
> system. I mean breaking like killing the ptrace parent right now...
> That can be fixed up reasonably securely too, but it _can't_ be
> autodetected easily (I keep cross logs for everything so I can trace
> it, but it won't be an immediate/automated task like the 0xf00f or
> fcnlex).

an additional idea: you could waitpid() from a 'watchdog' task, on the
parent, and if the parent dies unexpectedly (you get a SIGCHLD) then you
could immediately kill the child PID and log the incident. This still
leaves a small window, but one that is probably quite hard to abuse,
especially since the watchdog task will very likely have the highest
priority amongst those tasks (due to sleeping almost always). The
watchdog task would naturally be very small and hence there's near zero
OOM risk for the watchdog task.

and yet another idea: if you make the ptrace task the leader of its own
process group, then IIRC you will get a death signal delivered to the
ptraced task if the parent dies prematurely. This should close the race
pretty well too.

Ingo

2005-01-21 21:34:20

by Roland McGrath

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

> maybe this could even be fit into existing ptrace semantics, without any
> need for PTRACE_ATTACH_JAIL. What we need is to catch the case where a
> ptraced child is running (i.e. the signal_wake_up() has already been
> done, and the parent is waiting for the child to stop again), and the
> ptrace parent is killed unexpectedly. Would it be a correct fix to just
> unconditionally stop the child in this case (and leave it hanging in
> such a state)? Or to kill it right away?

That's the same as the case of the debugger dying or being killed by hand.
When gdb has a bug, people want to be able to kill it and get on with using
their program, not have their program always be killed too.

If you add this feature, it makes most sense IMHO to use PTRACE_SETOPTIONS
as the way to request it.


Thanks,
Roland

2005-01-22 02:52:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Fri, Jan 21, 2005 at 09:54:16PM +0100, Ingo Molnar wrote:
> - the second barrier is the 'jail' of the ptraced task. Especially with
> PTRACE_SYSCALL, the things a child ptraced process can do are
> extremely limited, everything it tries to do will trap, the task will
> suspend and the parent runs. The task is completely passive and ptrace
> on that end is a pretty small engine that stops/traps/restarts user
> processing without alot of frills.
>
> historically there has been alot less problems with the second barrier.
> (in fact i cannot remember even one security issue in that area.)

I agree there are less problems in that area. But there's still a great
deal of complexity in ptrace that I preferred to keep it out of the
security equation.

uml can't run with seccomp, uml is forced to ptrace, it has to trap the
arguments and everything.

Once kernel CVS returns up, I'll get an email as soon as somebody
touches kernel/seccomp.c or the other files involved, and I can keep the
eye on the code and verify all modifications very quickly (plus there
will be very few modifications on those files, unlike for the ptrace
code that is much more under deveopment). Keeping ptrace under control
would be more costly on my side.

> i'm not forcing anyone to do anything, but i think the most logical
> solution is to use ptrace. It's there on every Linux box so your client
> can run even on 'older' Linux boxes. (You might want to detect in the
> client whether the OOM race is fixed in a kernel, but it should not be a
> truly big issue.) Waiting for any extra API to get significant userbase
> takes at least 1-2 years - while ptrace is here and available on every

Note that I'm not ready for production myself yet, I'm suggesting to
include this now, exactly to get some real userbase ready in 1-2 years.
And after that with trusted computing it'll take another few years
before the trusted Cpushare exchange can start in parallel to the
seccomp one. My schedule is planned for a much longer timeframe, I
doubt anything significant could happen this year regardless of ptrace
or seccomp.

Plus I would never depend on the users to do the right thing (i.e. not
to run oom etc..). So I'm forced to wait the 1-2 years anyways either to
get seccomp merged, or to get your ptrace extension merged. If I use
ptrace, the current kernels can't prevent the Cpushare users to hurt
themself, so I won't allow current unpatched kernels to run.

I have no hurry, my first prio is to do everything safely, I don't care
to grow the userbase fast if I have to add some risk to the users to
do that.

Note also that all Cpushare client software that runs on the user
computers is GPL, in turn without pending patents and completely free
software, so you're very free to take it, rewrite it with ptrace, and
ship it to your users now. Even Microsoft can write its own Cpushare
client and ship it in Windows just fine. You can fake the kernel
version to tell the server 2.6.11+seccom is running, despite 2.6.9 with
the insecure ptrace might be running instead (the Cpushare protocol does
most checks on the server side btw). I have no control on that and as
long as I have no liability I'm fine (and I write in capital letters no
liability and no warranty in the account creation procedure of course).
But the client I will ship myself on cpushare.com will have security as
priority number 1 in mind, and in turn I can't allow it to run with the
current ptrace kernel code.

(however if you want to write your own client for your own OS, please
let me know privately, instead of faking the kernel version, that's
going to be more secure shall you need me to shutdown just your clients
because you found a security issue in your code)

If you noticed, I also made sure that after seccomp is enabled, it is
impossible to disable it:

/* can set it only once to be even more secure */
if (unlikely(tsk->seccomp_mode))
return -EPERM;

This is a *major* feature. I'm sure we can hack ptrace for that too with
yet another patch, but isn't it so much simpler to merge seccomp to get
the highest degree of security? The only way an user can screw himself
with seccomp is to write the right bit in /dev/mem at the right bit
offset. And I exclude that can happen by mistake. I mean, it has a
lower probability than a ram bitflip ;).

> Linux box. If you require 'users' to go with a new (or worse: patched)
> kernel then you are creating a pretty significant artificial market
> penetration barrier for your application.

This is fine. It's a long term project, I don't care about the short
term, I only care that the users are as safe as possible.

> also, with more applications relying on ptrace it will become more
> tested, more robust and people will do speedups. I think the fact that
> UML uses ptrace is already a very good sign that it's robust for such
> purposes. (_Also_, if there's a security problem in the ptrace barrier,
> you'd like to know about it and fix it ASAP - SuSE ships UML, right?)

The implications of an UML bug are minor, compared to a seccomp bug that
affects Cpushare. To exploit an UML bug you've to find a uml system
first. To exploit a Cpushare bug you only need some money in PayPal
account. I rate the two with different priority. This is why I made it
possible to abort all present and future transactions anytime anywhere
by clicking a button on my triband cellphone. A seccomp bug would block
the system until I figure out an upgrade path, so I really don't want
having to find seccomp bugs _ever_ ;). If there's a ptrace bug affecting
UML, let's fix it ASAP, but let's not stop Cpushare completely in the
meantime ;).

> an additional idea: you could waitpid() from a 'watchdog' task, on the
> parent, and if the parent dies unexpectedly (you get a SIGCHLD) then you
> could immediately kill the child PID and log the incident. This still
> leaves a small window, but one that is probably quite hard to abuse,
> especially since the watchdog task will very likely have the highest
> priority amongst those tasks (due to sleeping almost always). The
> watchdog task would naturally be very small and hence there's near zero
> OOM risk for the watchdog task.

Correct in theory, but in practice what you say is true only unless the
OOM killer is buggy like in 2.6 mainline incidentally and kills too many
tasks (including init, ask Thomas).

> and yet another idea: if you make the ptrace task the leader of its own
> process group, then IIRC you will get a death signal delivered to the
^^^^
> ptraced task if the parent dies prematurely. This should close the race
> pretty well too.

... unless during 2.7 somebody changes the signal code and breaks it.

But it's really the "IIRC" above that makes me not feel safe. If you're
not sure to recall correctly it means it's not obvious stuff. Seccomp is
obvious, there's no way you could ever add "IIRC" to the seccomp
security.

I believe only obvious stuff can go right. Murphy probably disagrees
even with this, but I'm confident of being able to screw Murphy by
keeping things obviously safe ;). Murphy can't add bugs if there are no
bugs in the code. It's not always possible to keep things obviously safe
(UML can't), Cpushare kernel-side is one exception and I'd like to take
advantage of it.

How can you get it wrong calculating 1+1? Do you know anybody getting
wrong 1+1=3 (eheh I'm kidding of course ;).

While if you compute something complex instead of 1+1, you'll start
people getting it wrong sometime, including myself, and that's the
people that will keep hacking on the same piece of the kernel in the
future.

So since I've a special requirements that I can accomplish in a isolated
place with zero runtime overhead, out of the development, I prefer to
take advantage of it. This isn't possible with UML.

Said that I agree in theory ptrace (after fixing the killing of the
parent in kernel space with zero window, and possibly after adding a
feature like -EPERM above) would work equally well too. But it just
doesn't seem worth it.

Note that as somebody else already said on l-k some time ago, a full
virtualization technology (like hardware emulation) is much more secure
than UML, exactly because it has fewer special cases, despite being very
complex too. I don't rate UML highly secure myself (there was at least
an exploit posted on bugtraq showing how to escape the uml jail some
time ago).

Thanks for the help.

2005-01-22 03:25:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Fri, Jan 21, 2005 at 01:31:46PM -0800, Roland McGrath wrote:
> When gdb has a bug, people want to be able to kill it and get on with using
> their program, not have their program always be killed too.

What I need is that the program is killed right away synchronously as
soon as the "debugger" detaches (to me that's a needed feature). No
matter why the debugger detached. This is the opposite of what
ptrace/strace does right now.

Just try to attach to a task with strace -p, then kill strace with -9,
the task will keep going like if nothing has happened. I need the child
killed too instead (before the parent unptrace the child).

Probably the reason why the app gets killed is that gdb is the ptrace
task is the process leader of the process group like Ingo suggested. But
I'd rather not depend on leaders/groups/pids/signals, when I can do it
with do_exit and a check on the syscall number.

Ptrace does a lot more of what I need, I don't care about parameters or
anything more than the syscall number, I don't need to change the
retvals during syscall return or to check registers or to stop a task.
Even the auditing subsystem could be implemented by putting all tasks
under strace and by having the ptracers communicating with each other
with pipes to generate a global info. But it wouldn't be as reliable and
as simple as having kernel code doing it.

I'm still open to do it with ptrace if there's a consensus on l-k to do
it in that direction, it's probably going to work fine too but if I
didn't feel safer with seccomp I would be doing ptrace in the first
place, it's not like I forgotten I could do it with ptrace too (like
Pavel already reminded me some month ago).

2005-01-22 10:33:07

by Pavel Machek

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Hi!

> > > > Yes, but do you care about the performance of syscalls
> > > > which the program isn't allowed to call at all ? ;)
> > >
> > > Heh, no, but it's for every syscall not just denied ones. Point is
> > > simply that ptrace (complexity aside) doesn't scale the same.
> >
> > seccomp is about CPU-intense calculation jobs - the only syscalls
> > allowed are read/write (and sigreturn). UML implements a full kernel
> > via ptrace and CPU-intense applications run at native speed.
>
> Indeed. Performance is not an issue (in the short term at least, since
> those syscalls will be probably network bound).
>
> The only reason I couldn't use ptrace is what you found, that is the oom
> killing of the parent (or a mistake of the CPU seller that kills it by
> mistake by hand, I must prevent him to screw himself ;). Even after
> fixing ptrace, I've an hard time to prefer ptrace, when a simple,
> localized and self contained solution like seccomp is available.

Well, seccomp is also getting very little testing, when ptrace gets a
lot of testing; I know that seccomp is simple, but I believe testing
coverage still make ptrace better choice.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-22 17:26:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sat, Jan 22, 2005 at 11:32:42AM +0100, Pavel Machek wrote:
> Well, seccomp is also getting very little testing, when ptrace gets a
> lot of testing; I know that seccomp is simple, but I believe testing
> coverage still make ptrace better choice.

It's not testing that makes code more secure. Testing verifys the code
works in production, but testing almost never helps to find security
issues, and often not even hidden subtle race conditions. Check how many
security bugs have been found with testing. Just go to bugtraq count
them. I simply cannot relay on testing for the security part. I will
relay on testing for everything else but not for this.

2005-01-22 19:43:00

by Pavel Machek

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Hi!

> > Well, seccomp is also getting very little testing, when ptrace gets a
> > lot of testing; I know that seccomp is simple, but I believe testing
> > coverage still make ptrace better choice.
>
> It's not testing that makes code more secure. Testing verifys the code
> works in production, but testing almost never helps to find security
> issues, and often not even hidden subtle race conditions. Check how many
> security bugs have been found with testing. Just go to bugtraq count
> them. I simply cannot relay on testing for the security part. I will
> relay on testing for everything else but not for this.

Well, then you can help auditing ptrace()... It is probably also true
that more people audited ptrace() than seccomp :-).
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-22 23:34:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sat, Jan 22, 2005 at 08:42:42PM +0100, Pavel Machek wrote:
> Well, then you can help auditing ptrace()... It is probably also true
> that more people audited ptrace() than seccomp :-).

Why should I spend time auditing ptrace when I have a superior solution
that doesn't require me any auditing at all? I've an huge pile of work,
I'm not doing this for fun, just thinking at wasting time auditing a
single line of ptrace code is insane as far as I'm concerned (if I can
avoid it with a more robust, less likely to break and simpler approach).
If the l-k community forces me to use ptrace, I'll be forced to do that
indeed (and you should be ready to take the blame if something goes
wrong), but be sure I'll try as much as I can to stay away from ptrace
completely. ptrace is a debugging knob, uml itself is a debugging tool
that depends on a debugging knob and that's fine. I'm not doing a
debugging tool, I'm doing something that requires the maximum level of
security ever, and using ptrace is dead wrong for that IMHO.

2005-01-23 00:08:19

by Pavel Machek

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Hi!

> > Well, then you can help auditing ptrace()... It is probably also true
> > that more people audited ptrace() than seccomp :-).
>
> Why should I spend time auditing ptrace when I have a superior solution
> that doesn't require me any auditing at all? I've an huge pile of
> work,

Adding code is easy, but in the long term would lead to maintainance
nightmare. Adding seccomp code that does subset of ptrace, just
because ptrace audit is lot of work, seems like a wrong thing to
do. Sorry.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-23 00:43:36

by Rik van Riel

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sun, 23 Jan 2005, Andrea Arcangeli wrote:

> I'm doing something that requires the maximum level of
> security ever,

You're kidding, right ?

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-01-23 00:46:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sun, Jan 23, 2005 at 01:07:04AM +0100, Pavel Machek wrote:
> Adding code is easy, but in the long term would lead to maintainance
> nightmare. Adding seccomp code that does subset of ptrace, just
> because ptrace audit is lot of work, seems like a wrong thing to
> do. Sorry.

Even if I do the ptrace audit right now, within 6 months something can
change and the implications of the changes won't be as trivial to
evaluate as if entry.S or seccomp.c have changed.

The userland side will be a lot more complicated too to implement.

Do you want video compressed strems to be played securely and
efficiently? I can't see a better solution than seccomp. ptrace would be
slower and it'd require ugly code to be written in userland. Streams
are going to pump some stuff into the pipes and this will avoid
quite a number of schedules per second (regardless of buffering). The
seccomp API is just tricky enough without having to hardcoded into every
userland app the number of the syscalls. Seccomp at least gives a slight
chance to write arch indipendent code while still providing lowlevel
security from the OS, there's no way to use ptrace_syscall in a arch
indipendent manner.

In the last patch I sent privately to Andrew I made it a config option,
but I recommend not to disable it, or you won't be able to run the
Cpushare client. Andrew's right seccomp.o would waste precious bytes
(not kbytes) on embedded systems, so it has to be a config option for
that. You can still modify it to use ptrace freely, but then I will have
nothing to do with the problems that may arise over time by using ptrace
within the GPL'd Cpushare client code and I personally do not approve
the use of ptrace there (but it's GPL so you can modify it). I'm doing
something that I can trust to run on my own desktop system, and
personally seccomp is the only thing I'm confortable to depend on. Plus
the userland gets so much simpler as well. It's not only a problem of
trusting the kernel space of ptrace.

2005-01-23 00:52:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sat, Jan 22, 2005 at 07:43:26PM -0500, Rik van Riel wrote:
> On Sun, 23 Jan 2005, Andrea Arcangeli wrote:
>
> >I'm doing something that requires the maximum level of
> >security ever,
>
> You're kidding, right ?

Why should I be kidding? The client code I'm doing, has to be at least as secure
as ssh and the firewall code, what else has to be more secure than that?
Nor ssh nor the firewall code depends on ptrace for their security. The
nice thing is that I can embed all the security in the kernel with
seccomp, and I'd be a fool not trying it to get it merged and to
complicate my life with ptrace.

Once seccomp is in, I believe there's a chance that security people uses
it for more than Cpushare while I don't think there's a chance you'll
see security people using ptrace_syscall hardcoding the syscall numbers
in every userland app out there that may have to parse untrusted data
with potentially buggy bytecode (i.e. decompression bytecode etc..).

2005-01-23 04:44:09

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sun, 23 Jan 2005 01:52:13 +0100, Andrea Arcangeli said:

> Why should I be kidding? The client code I'm doing, has to be at least as secure

Maybe in your estimation it *has* to be that secure. However, actual experience
with other operating systems indicate that the mail programs and web browsers
have *higher* security requirements than ssh - because ssh can afford to trust
legitimate users, while MUAs and browsers have to protect the system against
actions taken by legitimate users.

> as ssh and the firewall code, what else has to be more secure than that?

Mail programs, web browsers, and I'm sure there's plenty of applications in
the various Three Letter Agencies that want even more security.

It's a poor idea to confuse "secure" with "can't break out of the sandbox".

> Nor ssh nor the firewall code depends on ptrace for their security. The

And they don't even depend on seccomp or ptrace for the security either...

> Once seccomp is in, I believe there's a chance that security people uses
> it for more than Cpushare while I don't think there's a chance you'll

Security people probably won't be interested, specifically because it's
way too inflexible. Very few real-life applications can be made to fit
into a "open all the files you might need, then shut yourself into a
read/write syscalls only" model.

In fact, a case could be made that the unnatural contortions needed to
restructure applications into a seccomp model actually *decrease* the
overall security, because of more complicated setup code being more
vulnerable to attack. Also, the fact that you need to keep open() out
of the permitted set for seccomp to make any sense means that you need to
open all the possible files up front. So now you're handing the program
*more* access to files than they should....

> see security people using ptrace_syscall hardcoding the syscall numbers

Oh, come *ON*, Andrea. This is a red herring and you *know* it. The only
people who will be hardcoding syscall numbers are the same idiots that
hardcoded capability masks instead of "#include <linux/capability.h>" and
using the CAP_* defines.

> in every userland app out there that may have to parse untrusted data
> with potentially buggy bytecode (i.e. decompression bytecode etc..).

And if a filename has a runtime dependency on the untrusted data (consider
any sort of web server or browser or mail program or anything else that
accepts a "suggested filename" as input), things get very difficult very quickly.

I can pass ptrace a SYSCALL_OPEN, and then call my untrusted code, and then
look at the filename at runtime and see if there's something hinky going on.
I can even apply heuristics like "The first file opened should be THIS one,
then THOSE 4 shared libraries in order, then THIS file, and then the NEXT file
is dependent on user input, but has to start with $USER/tmp/workdir, and then
there's two other opens of files X and Y, and then no others should happen".
Using seccomp, you don't get that choice. You either have to jump through
hoops to get all that set up beforehand, or allow open() in all its glory.


Attachments:
(No filename) (226.00 B)

2005-01-23 06:12:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sat, Jan 22, 2005 at 11:43:06PM -0500, [email protected] wrote:
> It's a poor idea to confuse "secure" with "can't break out of the sandbox".

The only point I'm making with seccomp, is that if it can't break out of
the sandbox it's secure. I didn't mean that the only way to make it
secure is to put it in the sandbox of course.

> And they don't even depend on seccomp or ptrace for the security either...

Indeed.

> Security people probably won't be interested, specifically because it's
> way too inflexible. Very few real-life applications can be made to fit
> into a "open all the files you might need, then shut yourself into a
> read/write syscalls only" model.

This exactly correct. Recycled matter is of lower quality. Not
everything is going to be printed on recycled paper, your vacation
photos cannot be printed with recyled paper. But a few may actually
appreciate recycled paper at a much cheaper price for a extremely tiny
niche of apps. It'll be a mess to be able to use it the first time, but
after they start using it they'll get a ton of it very cheap and it
might work as good as first quality paper for them. Perhaps somebody not
buying paper because was too expensive, may also start buying the
recyled paper because it gets affordable (yeah after the initial
dealing with the recycled matter conversion).

> In fact, a case could be made that the unnatural contortions needed to
> restructure applications into a seccomp model actually *decrease* the
> overall security, because of more complicated setup code being more
> vulnerable to attack. Also, the fact that you need to keep open() out

All setup code before the execve of the loader (and the loader is few
lines of C only) is not in C/C++, which means first of all no buffer
overflows. It's a quite small piece of code as well. Sure there can be
still a bug there, but clearly somehow a software must exists to start
the seccomp mode. But this software won't be the binfmt_elf.c and it
will not be written in C (which is also why using ptrace is way
annoying, since it'd require more C code), it'll be small, and it will
be written with security in mind. I've already uploaded that software in
the website if you want to check it (ignore the gui part, it's obsolete).

Just the fact it's not in C rules out 90% of possible exploits.

> of the permitted set for seccomp to make any sense means that you need to
> open all the possible files up front. So now you're handing the program
> *more* access to files than they should....

They're not files, they're pipes. There are only two open, fd 0 and fd 1
and no data emitted and recevied by those two pipes is being
computed outside seccomp. It's like if you push .mpeg data into fd 0 and
you read from fd 1 and you write it in the framebuffer. Even if
something goes wrong into the library, as worse you'll see garbage on
the screen.

I don't think a model like this can decrease security.

The last YOU update I did, fetched an update of some decoding library,
now if it was running under seccomp it couldn't do any damage. The same
is true for the zlib trouble some time ago.

I'm not suggesting everything should run inside seccomp, and of course
such an update would be happening anyway since not every app will run
under seccomp, but certainly if you've a _special_ critical app that you
don't want to risk to be exploited by a libz bug, then seccomp may help
and it's going to be a lot more handy to use than ptrace.

> Oh, come *ON*, Andrea. This is a red herring and you *know* it. The only
> people who will be hardcoding syscall numbers are the same idiots that
> hardcoded capability masks instead of "#include <linux/capability.h>" and
> using the CAP_* defines.

I didn't mean hardcoding in terms of numbers, I mean in terms of
__NR_read. Just read the 32bit emulation code, I had to use ifdef
TIF_IA32, that's the best I could do, and I doubt you would be able to
write much cleaner code in userland either.

> And if a filename has a runtime dependency on the untrusted data (consider
> any sort of web server or browser or mail program or anything else that
> accepts a "suggested filename" as input), things get very difficult very quickly.
>
> I can pass ptrace a SYSCALL_OPEN, and then call my untrusted code, and then
> look at the filename at runtime and see if there's something hinky going on.
> I can even apply heuristics like "The first file opened should be THIS one,
> then THOSE 4 shared libraries in order, then THIS file, and then the NEXT file
> is dependent on user input, but has to start with $USER/tmp/workdir, and then
> there's two other opens of files X and Y, and then no others should happen".
> Using seccomp, you don't get that choice. You either have to jump through
> hoops to get all that set up beforehand, or allow open() in all its glory.

I don't get what you mean here. Anyway the filedescriptors inside
seccomp are never going to be files, and there will be only two. I can
add some documentation if it gets merged.

But the point you're making that nobody is going to use seccomp except
me may very well be right, I'm not trying to disagree with that. I never
stated that others will certainly use it. I only said "...there's a
chance that...". So you may be very well right that nobody else will use
it ever.

We can backout seccomp 3 years in the future if I don't even need it
anymore myself. Stuff gets backed out all the time if nobody uses it and
seccomp is trivial to backout by grepping for the word seccomp.

Thanks for your feedback!

2005-01-23 07:36:28

by daw

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Chris Wright wrote:
>* David Wagner ([email protected]) wrote:
>> There is a simple tweak to ptrace which fixes that: one could add an
>> API to specify a set of syscalls that ptrace should not trap on. To get
>> seccomp-like semantics, the user program could specify {read,write}, but
>> if the user program ever wants to change its policy, it could change that
>> set. Solaris /proc (which is what is used for tracing) has this feature.
>> I coded up such an extension to ptrace semantics a long time ago, and
>> it seemed to work fine for me, though of course I am not a ptrace expert.
>
>Hmm, yeah, that'd be nice. That only leaves the issue of tracer dying
>(say from that crazy oom killer ;-).

Yes, I also implemented was a ptrace option which causes the child to be
slaughtered if the parent dies for any reason. I could dig up the code,
but I don't recall it being very hard. This was ages ago (a 2.0.x kernel)
and I have no idea what might have changed. Also, am definitely not a
guru on kernel internals, so it is always possible I missed something.
But, at least on the surface this doesn't seem hard to implement.

A third thing I implemented was a option which would cause ptrace() to be
inherited across forks. The way that strace does this (last I looked)
is an unreliable abomination: when it sees a request to call fork(), it
sets a breakpoint at the next instruction after the fork() by re-writing
the code of the parent, then when that breakpoint triggers it attaches to
the child, restores the parent's code, and lets them continue executing.
This is icky, and I have little confidence in its security to prevent
children from escaping a ptrace() jail, so I added a feature to ptrace()
that remedies the situation.

Anyway, back to the main topic: ptrace() vs seccomp. I think one
plausible reason to prefer some mechanism that allows user level to
specify the allowed syscall set is that it might provide more flexibility.
What if 6 months from now we discover that we really should have enabled
one more syscall in seccomp to accomodate other applications?

At the same time, I truly empathize Andrea's position that something
like seccomp ought to be a lot easier to verify correct than ptrace().
I think several people here are underestimating the importance of
clean design. ptrace() is, frankly, a godawful mess, and I don't
know about this thinking that you can take a godawful mess and then
audit it carefully and call it secure -- well, that seems unlikely to
ever lead to the same level of assurance that you can get with a much
cleaner design. (This business of overloading as a means of sending
ptrace events to user level was in retrospect probably a bad design
decision, for instance. See, e.g., Section 12 of my MS thesis for more.
http://www.cs.berkeley.edu/~daw/papers/janus-masters.ps) Given this,
I can see real value in seccomp.

Perhaps there is a compromise position. What if one started from seccomp,
but then extended it so the set of allowed syscalls can be specified by
user level? This would push policy to user level, while retaining the
attractive simplicity and ease-of-audit properties of the seccomp design.
Does something like this make sense?

Let me give you some idea of new applications that might be enabled
by this kind of functionality. One cool idea is a 'delegating
architecture' for jails. The jailed process inherit an open file
descriptor to its jailor, and is only allowed to call read(), write(),
sendmsg(), and recvmsg(). If the jailed process wants to interact
with the outside world, it can send a request to its jailor to this
effect. For instance, suppose the jailed process wants to create a
file called "/tmp/whatever", so it sends this request to the jailor.
The jailor can decide whether it wants this to be allowed. If it is
to be allowed, the jailor can create this file and transfer a file
descriptor to the jailed process using sendmsg(). Note that this
mechanism allows the jailor to completely virtualize the system call
interface; for instance, the jailor could transparently instead create
"/tmp/jail17/whatever" and return a fd to it to the jailed process,
without the jailed process being any the wiser. (For more on this,
see http://www.stanford.edu/~talg/papers/NDSS04/abstract.html and
http://www.cs.jhu.edu/~seaborn/plash/plash.html)

So this is one example of an application that is enabled by adding
recvmsg() to the set of allowed syscalls. When it comes to the broader
question of seccomp vs ptrace(), I don't know what strategy makes most
sense for the Linux kernel, but I hope these ideas help give you some
idea of what might be possible and how these mechanisms could be used.

2005-01-24 15:10:25

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sun, Jan 23, 2005 at 07:34:24AM +0000, David Wagner wrote:
> Chris Wright wrote:
> >* David Wagner ([email protected]) wrote:
> >> There is a simple tweak to ptrace which fixes that: one could add an
> >> API to specify a set of syscalls that ptrace should not trap on. To get
> >> seccomp-like semantics, the user program could specify {read,write}, but
> >> if the user program ever wants to change its policy, it could change that
> >> set. Solaris /proc (which is what is used for tracing) has this feature.
> >> I coded up such an extension to ptrace semantics a long time ago, and
> >> it seemed to work fine for me, though of course I am not a ptrace expert.
> >
> >Hmm, yeah, that'd be nice. That only leaves the issue of tracer dying
> >(say from that crazy oom killer ;-).
>
> Yes, I also implemented was a ptrace option which causes the child to be
> slaughtered if the parent dies for any reason. I could dig up the code,
> but I don't recall it being very hard. This was ages ago (a 2.0.x kernel)
> and I have no idea what might have changed. Also, am definitely not a
> guru on kernel internals, so it is always possible I missed something.
> But, at least on the surface this doesn't seem hard to implement.

Maybe it's time to resubmit both of these. OTOH, maybe it's time to do
something more drastic to ptrace to untangle it from signals...

> A third thing I implemented was a option which would cause ptrace() to be
> inherited across forks. The way that strace does this (last I looked)
> is an unreliable abomination: when it sees a request to call fork(), it
> sets a breakpoint at the next instruction after the fork() by re-writing
> the code of the parent, then when that breakpoint triggers it attaches to
> the child, restores the parent's code, and lets them continue executing.
> This is icky, and I have little confidence in its security to prevent
> children from escaping a ptrace() jail, so I added a feature to ptrace()
> that remedies the situation.

This has since been done in 2.5.x; see PTRACE_EVENT_FORK. GDB even
uses it nowadays. I'm not sure if strace does.

--
Daniel Jacobowitz

2005-02-15 09:26:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

On Sun, Jan 23, 2005 at 07:34:24AM +0000, David Wagner wrote:
> What if 6 months from now we discover that we really should have enabled
> one more syscall in seccomp to accomodate other applications?

This is why there's a seccomp mode number, and you've to choose it, I
only implemented mode 0 so far.

If I'll need poll too, I'll add poll(2) to seccom mode 1, with a few
bytes of code. Until I'll need poll there's no point for me to add it
and I don't plan to use it in the short term. In the short term my only
object is to bootstrap the system with the minimum of functionality that
makes it useful and seccomp is the simplest and most reliably way to
make the sell client secure.

You can't start with the final thing, or you'll never have a chance to
bootstrap it. Plus if Cpushare doesn't work, there's no point to invest
so much work into it.

I'll use buffering in the twisted layer of the client, so that blocking
time will be reduced. So if you can compute asynchronously with the new
data coming, lack of select shouldn't be noticeable and I'm avoiding
everything that I can avoid.

Now I added a -EINVAL retval to the write too (previously you had to
read it back to see if the write went through, and that's what my client
code has been doing so far to verify the seccomp mode got enabled). It's
still good idea to keep reading it back as a double check, but the
-EINVAL now makes it even simpler:

if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
tsk->seccomp.mode = seccomp_mode;
set_tsk_thread_flag(tsk, TIF_SECCOMP);
} else
return -EINVAL;

x:~ # cat /proc/self/seccomp
0
x:~ # echo 2 > /proc/self/seccomp
x:~ # echo $?
1
x:~ # echo 1 > /proc/self/seccomp
Connection to x closed.


> At the same time, I truly empathize Andrea's position that something
> like seccomp ought to be a lot easier to verify correct than ptrace().
> I think several people here are underestimating the importance of
> clean design. ptrace() is, frankly, a godawful mess, and I don't
> know about this thinking that you can take a godawful mess and then
> audit it carefully and call it secure -- well, that seems unlikely to
> ever lead to the same level of assurance that you can get with a much
> cleaner design. (This business of overloading as a means of sending

Yep.

I strongly prefer to leave ptrace for debugging purposes and not to
depend on ptrace for production usage.

Think how much time I spent implementing seccomp and to adapt the sell
client to use it, how much code I've written, and then compared to a
ptrace (or Xen) solution.

I'll switch to an hypervisor only when there will be an advantage for me to
do so (i.e. after trusted computing will hit the hardware market).
Nothing will change for the buyers, I'll try as hard as possible to
allow the same buy client to buy trusted computing clients and seccomp
clients without even being able to notice the difference (except for the
price that I expect will be higher for trusted computing clients for
obvious reasons).

You can run the sell client in a virtualized environment already with
seccomp if you want to stack more jails one on top of the other (even in
virtualized linux environment from other OS) but you've to trust the
virtualization technology if you want to do that. I've currently no
resource to guarantee anything more than seccomp on top of bare hardware
will be rock solid.

Renzo (my ex prof of Operative Systems at University), is currently
working himself on a generic syscall trapping layer using ptrace for his
research, and he was indipendently asking me in the weekend why he has
to duplicate all those syscall numbers for every arch (he's doing only
x86 and ppc for a start), and I told him there's no way around it with
ptrace, and that's one of the reasons for me using seccomp is so much
easier for me. Renzo doesn't require the level of security that I
require with Cpushare and he requires much more flexibility than I do,
so for him ptrace is the simplest (just like uml).

There's no point for me do wait on all that work to be finished to
attempt to bootstrap Cpushare (the generic syscall trapping layer is
clearly a project in itself, on the same lines of uml) when what I want
to do is to define a "secure" model, that only allows message passing
through filedescriptors as the only means of interacting with other
processes and that solves issues with ptracer getting killed by oom or
stuff like that.

I promise that if Cpushare will go under (and it can really happen, no
way for me to know what will happen to it for sure), I'll spend the 5
minutes to drop seccomp from the kernel to remove unnecessary bloat ;).
Anybody should be able to drop seccomp from the kernel anytime (only
make sure that if you drop it, the /proc/<pid>/seccomp file will go away
too at the same time ;).

> Perhaps there is a compromise position. What if one started from seccomp,
> but then extended it so the set of allowed syscalls can be specified by
> user level? This would push policy to user level, while retaining the
> attractive simplicity and ease-of-audit properties of the seccomp design.
> Does something like this make sense?
>
> architecture' for jails. The jailed process inherit an open file
> descriptor to its jailor, and is only allowed to call read(), write(),
> sendmsg(), and recvmsg(). If the jailed process wants to interact

Why to call sendmsg/recvmsg when you can call read/write anyway? Note
also that all networking for me will be handled by twisted, no point to
attach raw sockets to the seccomp task, that would be a very bad idea, I
do both SSL encryption and huge buffering with multiplexing at the
twisted layer with the LGPL'd Cpushare protocol. The seccomp task only
see demultiplexed cleartext instead.

All sockets will implement SSL with verification of the certificate to
be sure they connect to cpushare.cpushare.com for real (JP had to fix a
bug in core twisted to allow me to catch the exception if a man gets in
the middle, this is fixed in SVN and future 2.0 and that's the only
reason I don't allow 1.3.0), and I don't want to create unnecessary
sockets, nor to have the seccomp task handle SSL, to avoid worthless
overloading of my server epoll and SSL handshakes. Cpushare handles
everything with a single encrypted TCP connection per sell client.
Really you'll have to make two connections sometime so I can move you to
a Cpushare-gateway closer to your IP, but in the short term I've a
single server in Italy, so for now it's really only a single TCP
connection. In the long run you'll even be able to open new CPUs on
remote systems from a client bytecode so the workload can spread
exponentially (short term the protocol only allows the buyer client to
open new encrypted sockets). After I will allow that in the Cpushare
protocol, you won't even notice about all the
SSL/connect/accept/sendmsg/readmsg details (I won't even deal with that
myself since twisted will do it for me without any C/java/C# annoyance).
I don't exclude in the long run I'll have to extend and rewrite it in
C++ for performance reasons, but for now it'd be just an unnecessary
delay (twisted saturates a 100mbit, and I won't have a 100mbit
connection to internet ;).

There's no need of allowing connect/accept/sendmsg/recvmsg, you've only
to implement a protocol like I'm doing.

The idea is to close the untrusted stuff in the smallest possible place,
and to validate everything it asks with full flexibility with twisted.
And all the verification stuff will be implemented with an interpreted
language and it won't be performance critical, so no buffer overflows
and the like.

The only thing I wonder about is poll(2), opening the pipes in
noblocking mode is trivial for the seccomp loader, but without poll
it'll never be able to sleep if I set the pipe in nonblocking mode. But
I don't care right now, and I'll leave it in blocking mode (the
buffering will take care of it), but I don't exclude there will be a
seccomp mode 1 with poll added to the allowed syscall list (that takes
*30* seconds to implement).

> So this is one example of an application that is enabled by adding
> recvmsg() to the set of allowed syscalls. When it comes to the broader
> question of seccomp vs ptrace(), I don't know what strategy makes most
> sense for the Linux kernel, but I hope these ideas help give you some
> idea of what might be possible and how these mechanisms could be used.

The idea of seccomp is to close the task in a jail, and to do all other
risky operations with a task outside (LGPL'd sell client in my case with
Cpushare), the task outside will have to validate all the requests
coming from the untrusted bytcode, just like the kernel validates the
parameters of the syscalls, no difference. Everything you get from the
network will have to be validate anyways, so it's not a big news. With
seccomp it's like if the seccomp task wasn't running on your computer
but on a remote server and in turn you can't trust the message it sends
to you. Except you won't need SSL between the sell client and the
seccomp task since nobody will be able to sniff the pipe ;)

All the sell client does is to provide an API between the buy client and
the untrusted bytecode that runs remotely. Both the buy client and the
sell clients will be library code for people to use.

Luckily to use it you won't have to understand a thing of the above ;)

2005-02-15 09:32:51

by Andrea Arcangeli

[permalink] [raw]
Subject: seccomp for 2.6.11-rc4

Hello,

This is the latest version against 2.6.11-rc4:

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.11-rc4/seccomp

I'd need it merged into mainline at some point, unless anybody has
strong arguments against it. All I can guarantee here, is that I'll back
it out myself in the future, iff Cpushare will fail and nobody else
started using it in the meantime for similar security purposes.

Thanks.

2005-02-16 05:25:10

by Herbert Poetzl

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc4

On Tue, Feb 15, 2005 at 10:32:44AM +0100, Andrea Arcangeli wrote:
> Hello,
>
> This is the latest version against 2.6.11-rc4:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.11-rc4/seccomp
>
> I'd need it merged into mainline at some point, unless anybody has
> strong arguments against it. All I can guarantee here, is that I'll back
> it out myself in the future, iff Cpushare will fail and nobody else
> started using it in the meantime for similar security purposes.

hmm, just an idea, but have you thought about using
an indirect syscall table for your purposes?

current->syscall_table

and have a table for every 'mode' you want to use ...

or maybe have a 'mask' for every syscall (in a
separate table) which describes the allowed 'modes'

just because checking the syscall number in a loop
doesn't look very scaleable to me ...

best,
Herbert

> Thanks.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-02-18 02:26:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc4

On Wed, Feb 16, 2005 at 06:25:03AM +0100, Herbert Poetzl wrote:
> hmm, just an idea, but have you thought about using
> an indirect syscall table for your purposes?
>
> current->syscall_table
>
> and have a table for every 'mode' you want to use ...

That would add an additional level of indirection for every syscall
(you'll have to potentially waste a cacheline to read the address of the
syscall table).

While my current approach is absolutely zero cost for the fast path on
x86-64 and it's a mere s/movb/movw/ for x86 (i.e. zero for x86 too).
Perahps I could even get away with a movb on x86 but frankly I didn't
even try ;)

My priority has been not to change the fast path at all, and clearly I
have to add a bitflag to achieve that. And once I've the bitflag it's
not worth it anymore to change the syscall table, and I can validate the
syscall number right away (this avoid building arrays and other more
complex stuff).

> or maybe have a 'mask' for every syscall (in a
> separate table) which describes the allowed 'modes'
>
> just because checking the syscall number in a loop
> doesn't look very scaleable to me ...

You're right about it being O(N) if you use it for all modes, but it's
really O(1) since it's being used for mode 0 only, and the number of
syscalls in mode 0 is fixed so it's O(1) and more important the number
is so small that it's really like O(1) in practice too (and not only in
math terms just because the number of syscalls in mode 0 is fixed ;).

Each mode can implement the mask as it wishes, so if you were to allow
hundred of syscalls in mode 1 then you'd better implement the check as a
bitmask as you suggested and you can do that while implementing mode 1.

But seccomp isn't designed to allow a ton of syscalls, so there can be
tiny differences between mode 0/1/2 and they should all have very few
syscalls, so I doubt it'd worth implementing the bitmask thingy right now.

Thanks.

2005-02-25 19:02:29

by daw

[permalink] [raw]
Subject: Re: seccomp for 2.6.11-rc1-bk8

Andrea Arcangeli wrote:
>On Sun, Jan 23, 2005 at 07:34:24AM +0000, David Wagner wrote:
>> [...Ostia...] The jailed process inherit an open file
>> descriptor to its jailor, and is only allowed to call read(), write(),
>> sendmsg(), and recvmsg(). [...]
>
>Why to call sendmsg/recvmsg when you can call read/write anyway?

Because sendmsg() and recvmsg() allow passing of file descriptors,
and read() and write() do not. For some uses of this kind of jail,
the ability to pass file descriptors to/from your master is a big deal.
It enables significant new uses of seccomp. Right now, the only way a
master can get a fd to the jail is to inherit that fd across fork(),
but this isn't as flexible and it restricts the ability to pass fds
interactively.

Andrea, I understand that you don't have any use for sendmsg()/recvmsg()
in your Cpushare application. I'm thinking about this from the point of
view of other potential users of seccomp. I believe there are several
other applications which might benefit from seccomp, if only it were
to allow fd-passing. If we're going to deploy this in the mainstream
kernel, maybe it makes sense to enable other uses as well. And that's
why I suggested allowing sendmsg() and recvmsg().

It might be worth considering.

[Sorry for the very late reply; I've been occupied with other things
since your last reply.]