Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751875AbaKBQhH (ORCPT ); Sun, 2 Nov 2014 11:37:07 -0500 Received: from mail-wg0-f48.google.com ([74.125.82.48]:43553 "EHLO mail-wg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751133AbaKBQhC (ORCPT ); Sun, 2 Nov 2014 11:37:02 -0500 Message-ID: <54565DA4.50408@gmail.com> Date: Sun, 02 Nov 2014 17:36:52 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: Kees Cook CC: mtk.manpages@gmail.com, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] seccomp.2: document seccomp syscall References: <1411685267-27949-1-git-send-email-keescook@chromium.org> <1411685267-27949-2-git-send-email-keescook@chromium.org> In-Reply-To: <1411685267-27949-2-git-send-email-keescook@chromium.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Kees, On 09/26/2014 12:47 AM, Kees Cook wrote: > Combines documentation from prctl, in-kernel seccomp_filter.txt and > dropper.c, along with details specific to the new syscall. I am working on integrating this page at the moment, and I'll have an edited draft for you sometime soon, I hope. In the meantime, I have a question. Could you show a sample run of the example program given in the man page? In my attempts so far, I always get EINVAL from seccomp(). Obviously, I am missing something. Cheers, Michael > Signed-off-by: Kees Cook > --- > man2/seccomp.2 | 400 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 400 insertions(+) > create mode 100644 man2/seccomp.2 > > diff --git a/man2/seccomp.2 b/man2/seccomp.2 > new file mode 100644 > index 0000000..f64950f > --- /dev/null > +++ b/man2/seccomp.2 > @@ -0,0 +1,400 @@ > +.\" Copyright (C) 2014 Kees Cook > +.\" and Copyright (C) 2012 Will Drewry > +.\" and Copyright (C) 2008 Michael Kerrisk > +.\" > +.\" %%%LICENSE_START(VERBATIM) > +.\" Permission is granted to make and distribute verbatim copies of this > +.\" manual provided the copyright notice and this permission notice are > +.\" preserved on all copies. > +.\" > +.\" Permission is granted to copy and distribute modified versions of this > +.\" manual under the conditions for verbatim copying, provided that the > +.\" entire resulting derived work is distributed under the terms of a > +.\" permission notice identical to this one. > +.\" > +.\" Since the Linux kernel and libraries are constantly changing, this > +.\" manual page may be incorrect or out-of-date. The author(s) assume no > +.\" responsibility for errors or omissions, or for damages resulting from > +.\" the use of the information contained herein. The author(s) may not > +.\" have taken the same level of care in the production of this manual, > +.\" which is licensed free of charge, as they might when working > +.\" professionally. > +.\" > +.\" Formatted or processed versions of this manual, if unaccompanied by > +.\" the source, must acknowledge the copyright and authors of this work. > +.\" %%%LICENSE_END > +.\" > +.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual" > +.SH NAME > +seccomp \- > +operate on Secure Computing state of the process > +.SH SYNOPSIS > +.nf > +.B #include > +.B #include > +.B #include > +.B #include > +.B #include > + > +.BI "int seccomp(unsigned int " operation ", unsigned int " flags , > +.BI " void *" args ); > +.fi > +.SH DESCRIPTION > +The > +.BR seccomp () > +system call operates on the Secure Computing (seccomp) state of the > +current process. > + > +Currently, Linux supports the following > +.IR operation > +values: > +.TP > +.BR SECCOMP_SET_MODE_STRICT > +Only system calls that the thread is permitted to make are > +.BR read (2), > +.BR write (2), > +.BR _exit (2), > +and > +.BR sigreturn (2). > +Other system calls result in the delivery of a > +.BR SIGKILL > +signal. Strict secure computing mode is useful for number-crunching > +applications that may need to execute untrusted byte code, perhaps > +obtained by reading from a pipe or socket. > + > +This operation is available only if the kernel is configured with > +.BR CONFIG_SECCOMP > +enabled. > + > +The value of > +.IR flags > +must be 0, and > +.IR args > +must be NULL. > + > +This operation is functionally identical to calling > +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" . > +.TP > +.BR SECCOMP_SET_MODE_FILTER > +The system calls allowed are defined by a pointer to a Berkeley Packet > +Filter (BPF) passed via > +.IR args . > +This argument is a pointer to > +.IR "struct\ sock_fprog" ; > +it can be designed to filter arbitrary system calls and system call > +arguments. If the filter is invalid, the call will fail, returning > +.BR EACCESS > +in > +.IR errno . > + > +If > +.BR fork (2), > +.BR clone (2), > +or > +.BR execve (2) > +are allowed by the filter, any child processes will be constrained to > +the same filters and system calls as the parent. > + > +Prior to using this operation, the process must call > +.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)" > +or run with > +.BR CAP_SYS_ADMIN > +privileges in its namespace. If these are not true, the call will fail > +and return > +.BR EACCES > +in > +.IR errno . > +This requirement ensures that filter programs cannot be applied to child > +processes with greater privileges than the process that installed them. > + > +Additionally, if > +.BR prctl (2) > +or > +.BR seccomp (2) > +is allowed by the attached filter, additional filters may be layered on > +which will increase evaluation time, but allow for further reduction of > +the attack surface during execution of a process. > + > +This operation is available only if the kernel is configured with > +.BR CONFIG_SECCOMP_FILTER > +enabled. > + > +When > +.IR flags > +are 0, this operation is functionally identical to calling > +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" . > + > +The recognized > +.IR flags > +are: > +.RS > +.TP > +.BR SECCOMP_FILTER_FLAG_TSYNC > +When adding a new filter, synchronize all other threads of the current > +process to the same seccomp filter tree. If any thread cannot do this, > +the call will not attach the new seccomp filter, and will fail returning > +the first thread ID found that cannot synchronize. Synchronization will > +fail if another thread is in > +.BR SECCOMP_MODE_STRICT > +or if it has attached new seccomp filters to itself, diverging from the > +calling thread's filter tree. > +.RE > +.SH FILTERS > +When adding filters via > +.BR SECCOMP_SET_MODE_FILTER , > +.IR args > +points to a filter program: > + > +.in +4n > +.nf > +struct sock_fprog { > + unsigned short len; /* Number of BPF instructions */ > + struct sock_filter *filter; > +}; > +.fi > +.in > + > +Each program must contain one or more BPF instructions: > + > +.in +4n > +.nf > +struct sock_filter { /* Filter block */ > + __u16 code; /* Actual filter code */ > + __u8 jt; /* Jump true */ > + __u8 jf; /* Jump false */ > + __u32 k; /* Generic multiuse field */ > +}; > +.fi > +.in > + > +When executing the instructions, the BPF program executes over the > +syscall information made available via: > + > +.in +4n > +.nf > +struct seccomp_data { > + int nr; /* system call number */ > + __u32 arch; /* AUDIT_ARCH_* value */ > + __u64 instruction_pointer; /* CPU instruction pointer */ > + __u64 args[6]; /* up to 6 system call arguments */ > +}; > +.fi > +.in > + > +A seccomp filter may return any of the following values. If multiple > +filters exist, the return value for the evaluation of a given system > +call will always use the highest precedent value. (For example, > +.BR SECCOMP_RET_KILL > +will always take precedence.) > + > +In precedence order, they are: > +.TP > +.BR SECCOMP_RET_KILL > +Results in the task exiting immediately without executing the > +system call. The exit status of the task (status & 0x7f) will > +be > +.BR SIGSYS , > +not > +.BR SIGKILL . > +.TP > +.BR SECCOMP_RET_TRAP > +Results in the kernel sending a > +.BR SIGSYS > +signal to the triggering task without executing the system call. > +.IR siginfo\->si_call_addr > +will show the address of the system call instruction, and > +.IR siginfo\->si_syscall > +and > +.IR siginfo\->si_arch > +will indicate which syscall was attempted. The program counter will be > +as though the syscall happened (i.e. it will not point to the syscall > +instruction). The return value register will contain an arch\-dependent > +value; if resuming execution, set it to something sensible. > +(The architecture dependency is because replacing it with > +.BR ENOSYS > +could overwrite some useful information.) > + > +The > +.BR SECCOMP_RET_DATA > +portion of the return value will be passed as > +.IR si_errno . > + > +.BR SIGSYS > +triggered by seccomp will have a > +.IR si_code > +of > +.BR SYS_SECCOMP . > +.TP > +.BR SECCOMP_RET_ERRNO > +Results in the lower 16-bits of the return value being passed > +to userland as the > +.IR errno > +without executing the system call. > +.TP > +.BR SECCOMP_RET_TRACE > +When returned, this value will cause the kernel to attempt to > +notify a ptrace()-based tracer prior to executing the system > +call. If there is no tracer present, > +.BR ENOSYS > +is returned to userland and the system call is not executed. > + > +A tracer will be notified if it requests > +.BR PTRACE_O_TRACESECCOMP > +using > +.IR ptrace(PTRACE_SETOPTIONS) . > +The tracer will be notified of a > +.BR PTRACE_EVENT_SECCOMP > +and the > +.BR SECCOMP_RET_DATA > +portion of the BPF program return value will be available to the tracer > +via > +.BR PTRACE_GETEVENTMSG . > + > +The tracer can skip the system call by changing the syscall number > +to \-1. Alternatively, the tracer can change the system call > +requested by changing the system call to a valid syscall number. If > +the tracer asks to skip the system call, then the system call will > +appear to return the value that the tracer puts in the return value > +register. > + > +The seccomp check will not be run again after the tracer is > +notified. (This means that seccomp-based sandboxes MUST NOT > +allow use of ptrace, even of other sandboxed processes, without > +extreme care; ptracers can use this mechanism to escape.) > +.TP > +.BR SECCOMP_RET_ALLOW > +Results in the system call being executed. > + > +If multiple filters exist, the return value for the evaluation of a > +given system call will always use the highest precedent value. > + > +Precedence is only determined using the > +.BR SECCOMP_RET_ACTION > +mask. When multiple filters return values of the same precedence, > +only the > +.BR SECCOMP_RET_DATA > +from the most recently installed filter will be returned. > +.SH RETURN VALUE > +On success, > +.BR seccomp () > +returns 0. > +On error, if > +.BR SECCOMP_FILTER_FLAG_TSYNC > +was used, the return value is the thread ID that caused the > +synchronization failure. On other errors, \-1 is returned, and > +.IR errno > +is set to indicate the cause of the error. > +.SH ERRORS > +.BR seccomp () > +can fail for the following reasons: > +.TP > +.BR EACCESS > +the caller did not have the > +.BR CAP_SYS_ADMIN > +capability, or had not set > +.IR no_new_privs > +before using > +.BR SECCOMP_SET_MODE_FILTER . > +.TP > +.BR EFAULT > +.IR args > +was required to be a valid address. > +.TP > +.BR EINVAL > +.IR operation > +is unknown; or > +.IR flags > +are invalid for the given > +.IR operation > +.TP > +.BR ESRCH > +Another thread caused a failure during thread sync, but its ID could not > +be determined. > +.SH VERSIONS > +This system call first appeared in Linux 3.16. > +.\" FIXME Add glibc version > +.SH CONFORMING TO > +This system call is a nonstandard Linux extension. > +.SH NOTES > +.BR seccomp () > +provides a superset of the functionality provided by > +.IR PR_SET_SECCOMP > +of > +.BR prctl (2) . > +(Which does not support > +.IR flags .) > +.SH EXAMPLE > +.nf > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static int install_filter(int syscall, int arch, int error) > +{ > + struct sock_filter filter[] = { > + /* Load architecture. */ > + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, > + (offsetof(struct seccomp_data, arch))), > + /* Jump forward 4 instructions on architecture mismatch. */ > + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 4), > + /* Load syscall number. */ > + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, > + (offsetof(struct seccomp_data, nr))), > + /* Jump forward 1 instruction on syscall mismatch. */ > + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, syscall, 0, 1), > + /* Matching arch and syscall: return specific errno. */ > + BPF_STMT(BPF_RET+BPF_K, > + SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)), > + /* Destination of syscall mismatch: Allow other syscalls. */ > + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), > + /* Destination of arch mismatch: Kill process. */ > + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), > + }; > + struct sock_fprog prog = { > + .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])), > + .filter = filter, > + }; > + if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) { > + perror("seccomp"); > + return EXIT_FAILURE; > + } > + return EXIT_SUCCESS; > +} > + > +int main(int argc, char **argv) > +{ > + if (argc < 5) { > + fprintf(stderr, "Usage:\\n" > + "refuse []\\n" > + "Hint: AUDIT_ARCH_I386: 0x%X\\n" > + " AUDIT_ARCH_X86_64: 0x%X\\n" > + "\\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64); > + return EXIT_FAILURE; > + } > + if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { > + perror("prctl"); > + return EXIT_FAILURE; > + } > + if (install_filter(strtol(argv[1], NULL, 0), > + strtol(argv[2], NULL, 0), > + strtol(argv[3], NULL, 0))) > + return EXIT_FAILURE; > + execv(argv[4], &argv[4]); > + perror("execv"); > + return EXIT_FAILURE; > +} > +.fi > +.SH SEE ALSO > +.ad l > +.nh > +.BR prctl (2), > +.BR ptrace (2), > +.BR signal (7), > +.BR socket (7) > +.ad > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/