Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755824Ab2BVDlz (ORCPT ); Tue, 21 Feb 2012 22:41:55 -0500 Received: from mail-lpp01m010-f46.google.com ([209.85.215.46]:43963 "EHLO mail-lpp01m010-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755774Ab2BVDlv convert rfc822-to-8bit (ORCPT ); Tue, 21 Feb 2012 22:41:51 -0500 MIME-Version: 1.0 In-Reply-To: <20120221231230.GK3990@outflux.net> References: <1329845435-2313-1-git-send-email-wad@chromium.org> <1329845435-2313-11-git-send-email-wad@chromium.org> <20120221231230.GK3990@outflux.net> Date: Tue, 21 Feb 2012 19:41:49 -0800 Message-ID: Subject: Re: [PATCH v10 11/11] Documentation: prctl/seccomp_filter From: Will Drewry To: Kees Cook Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, kernel-hardening@lists.openwall.com, netdev@vger.kernel.org, x86@kernel.org, arnd@arndb.de, davem@davemloft.net, hpa@zytor.com, mingo@redhat.com, oleg@redhat.com, peterz@infradead.org, rdunlap@xenotime.net, mcgrathr@chromium.org, tglx@linutronix.de, luto@mit.edu, eparis@redhat.com, serge.hallyn@canonical.com, djm@mindrot.org, scarybeasts@gmail.com, indan@nul.nu, pmoore@redhat.com, akpm@linux-foundation.org, corbet@lwn.net, eric.dumazet@gmail.com, markus@chromium.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13034 Lines: 289 On Tue, Feb 21, 2012 at 3:12 PM, Kees Cook wrote: > Hi, > > I've collected the initial no-new-privs patches, and this whole series > and pushed it here so I could more easily review it: > http://git.kernel.org/?p=linux/kernel/git/kees/linux.git;a=shortlog;h=refs/heads/seccomp > > Some minor tweaks below... > > On Tue, Feb 21, 2012 at 11:30:35AM -0600, Will Drewry wrote: >> Documents how system call filtering using Berkeley Packet >> Filter programs works and how it may be used. >> Includes an example for x86 (32-bit) and a semi-generic >> example using a macro-based code generator. >> >> v10: - update for SIGSYS >> ? ? ?- update for new seccomp_data layout >> ? ? ?- update for ptrace option use >> v9: - updated bpf-direct.c for SIGILL >> v8: - add PR_SET_NO_NEW_PRIVS to the samples. >> v7: - updated for all the new stuff in v7: TRAP, TRACE >> ? ? - only talk about PR_SET_SECCOMP now >> ? ? - fixed bad JLE32 check (coreyb@linux.vnet.ibm.com) >> ? ? - adds dropper.c: a simple system call disabler >> v6: - tweak the language to note the requirement of >> ? ? ? PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu) >> v5: - update sample to use system call arguments >> ? ? - adds a "fancy" example using a macro-based generator >> ? ? - cleaned up bpf in the sample >> ? ? - update docs to mention arguments >> ? ? - fix prctl value (eparis@redhat.com) >> ? ? - language cleanup (rdunlap@xenotime.net) >> v4: - update for no_new_privs use >> ? ? - minor tweaks >> v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net) >> ? ? - document use of tentative always-unprivileged >> ? ? - guard sample compilation for i386 and x86_64 >> v2: - move code to samples (corbet@lwn.net) >> >> Signed-off-by: Will Drewry >> --- >> ?Documentation/prctl/seccomp_filter.txt | ?157 +++++++++++++++++++++ >> ?samples/Makefile ? ? ? ? ? ? ? ? ? ? ? | ? ?2 +- >> ?samples/seccomp/Makefile ? ? ? ? ? ? ? | ? 31 ++++ >> ?samples/seccomp/bpf-direct.c ? ? ? ? ? | ?150 ++++++++++++++++++++ >> ?samples/seccomp/bpf-fancy.c ? ? ? ? ? ?| ?102 ++++++++++++++ >> ?samples/seccomp/bpf-helper.c ? ? ? ? ? | ? 89 ++++++++++++ >> ?samples/seccomp/bpf-helper.h ? ? ? ? ? | ?236 ++++++++++++++++++++++++++++++++ >> ?samples/seccomp/dropper.c ? ? ? ? ? ? ?| ? 68 +++++++++ >> ?8 files changed, 834 insertions(+), 1 deletions(-) >> ?create mode 100644 Documentation/prctl/seccomp_filter.txt >> ?create mode 100644 samples/seccomp/Makefile >> ?create mode 100644 samples/seccomp/bpf-direct.c >> ?create mode 100644 samples/seccomp/bpf-fancy.c >> ?create mode 100644 samples/seccomp/bpf-helper.c >> ?create mode 100644 samples/seccomp/bpf-helper.h >> ?create mode 100644 samples/seccomp/dropper.c >> >> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt >> new file mode 100644 >> index 0000000..7de865b >> --- /dev/null >> +++ b/Documentation/prctl/seccomp_filter.txt >> @@ -0,0 +1,157 @@ >> + ? ? ? ? ? ? SECure COMPuting with filters >> + ? ? ? ? ? ? ============================= >> + >> +Introduction >> +------------ >> + >> +A large number of system calls are exposed to every userland process >> +with many of them going unused for the entire lifetime of the process. >> +As system calls change and mature, bugs are found and eradicated. ?A >> +certain subset of userland applications benefit by having a reduced set >> +of available system calls. ?The resulting set reduces the total kernel >> +surface exposed to the application. ?System call filtering is meant for >> +use with those applications. >> + >> +Seccomp filtering provides a means for a process to specify a filter for >> +incoming system calls. ?The filter is expressed as a Berkeley Packet >> +Filter (BPF) program, as with socket filters, except that the data >> +operated on is related to the system call being made: system call >> +number and the system call arguments. ?This allows for expressive >> +filtering of system calls using a filter program language with a long >> +history of being exposed to userland and a straightforward data set. >> + >> +Additionally, BPF makes it impossible for users of seccomp to fall prey >> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system >> +call interposition frameworks. ?BPF programs may not dereference >> +pointers which constrains all filters to solely evaluating the system >> +call arguments directly. >> + >> +What it isn't >> +------------- >> + >> +System call filtering isn't a sandbox. ?It provides a clearly defined >> +mechanism for minimizing the exposed kernel surface. ?It is meant to be >> +a tool for sandbox developers to use. ?Beyond that, policy for logical >> +behavior and information flow should be managed with a combination of >> +other system hardening techniques and, potentially, an LSM of your >> +choosing. ?Expressive, dynamic filters provide further options down this >> +path (avoiding pathological sizes or selecting which of the multiplexed >> +system calls in socketcall() is allowed, for instance) which could be >> +construed, incorrectly, as a more complete sandboxing solution. >> + >> +Usage >> +----- >> + >> +An additional seccomp mode is added and is enabled using the same >> +prctl(2) call as the strict seccomp. ?If the architecture has >> +CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: >> + >> +PR_SET_SECCOMP: >> + ? ? Now takes an additional argument which specifies a new filter >> + ? ? using a BPF program. >> + ? ? The BPF program will be executed over struct seccomp_data >> + ? ? reflecting the system call number, arguments, and other >> + ? ? metadata. ?The BPF program must then return one of the >> + ? ? acceptable values to inform the kernel which action should be >> + ? ? taken. >> + >> + ? ? Usage: >> + ? ? ? ? ? ? prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); >> + >> + ? ? The 'prog' argument is a pointer to a struct sock_fprog which >> + ? ? will contain the filter program. ?If the program is invalid, the >> + ? ? call will return -1 and set errno to EINVAL. >> + >> + ? ? Note, is_compat_task is also tracked for the @prog. ?This means >> + ? ? that once set the calling task will have all of its system calls >> + ? ? blocked if it switches its system call ABI. >> + >> + ? ? If fork/clone and execve are allowed by @prog, any child >> + ? ? processes will be constrained to the same filters and system >> + ? ? call ABI as the parent. >> + >> + ? ? Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or >> + ? ? run with CAP_SYS_ADMIN privileges in its namespace. ?If these are not >> + ? ? true, -EACCES will be returned. ?This requirement ensures that filter >> + ? ? programs cannot be applied to child processes with greater privileges >> + ? ? than the task that installed them. >> + >> + ? ? Additionally, if prctl(2) is allowed by the attached filter, >> + ? ? additional filters may be layered on which will increase evaluation >> + ? ? time, but allow for further decreasing the attack surface during >> + ? ? execution of a process. >> + >> +The above call returns 0 on success and non-zero on error. >> + >> +Return values >> +------------- >> + >> +A seccomp filter may return any of the following values: >> + ? ? SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP, >> + ? ? SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE. >> + >> +SECCOMP_RET_ALLOW: >> + ? ? If all filters for a given task return this value then >> + ? ? the system call will proceed normally. >> + >> +SECCOMP_RET_KILL: >> + ? ? If any filters for a given take return this value then >> + ? ? the task will exit immediately without executing the system >> + ? ? call. >> + >> +SECCOMP_RET_TRAP: >> + ? ? If any filters specify SECCOMP_RET_TRAP and none of them >> + ? ? specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP >> + ? ? signal to the task and not execute the system call. ?The kernel >> + ? ? will rollback the register state to just before system call >> + ? ? entry such that a signal handler in the process will be able >> + ? ? to inspect the ucontext_t->uc_mcontext registers and emulate >> + ? ? system call success or failure upon return from the signal >> + ? ? handler. >> + >> + ? ? The SIGTRAP is differentiated by other SIGTRAPS by a si_code >> + ? ? of TRAP_SECCOMP. > > This should reflect the SIGTRAP->SIGSYS change (and SYS_SECCOMP si_code > change). Oops - yup. >> + >> +SECCOMP_RET_ERRNO: >> + ? ? If returned, the value provided in the lower 16-bits is >> + ? ? returned to userland as the errno and the system call is >> + ? ? not executed. > > The other sections each say "If any" or "If all" to clarify their > behavior with multiple filters. The same should be done here, but more > comments below. Additionally, it should clarify that on multiple > uses of RET_ERRNO, the lower of the errnos will be returned. I might drop all of the written out precedence verbiage since your layout is more intuitive without it I think. >> + >> +SECCOMP_RET_TRACE: >> + ? ? If any filters return this value and the others return >> + ? ? SECCOMP_RET_ALLOW, then the kernel will attempt to notify >> + ? ? a ptrace()-based tracer prior to executing the system call. >> + >> + ? ? A tracer will be notified if it requests PTRACE_O_TRACESECCOMP >> + ? ? via PTRACE_SETOPTIONS. ?Otherwise, the system call will >> + ? ? not execute and -ENOSYS will be returned to userspace. >> + >> + ? ? If the tracer ignores notification, then the system call will >> + ? ? proceed normally. ?Changes to the registers will function >> + ? ? similarly to PTRACE_SYSCALL. ?Additionally, if the tracer >> + ? ? detaches during notification or just after, the task may be >> + ? ? terminated as precautionary measure. >> + >> +Please note that the order of precedence is as follows: >> +SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP, >> +SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW. >> + >> +If multiple filters exist, the return value for the evaluation of a given >> +system call will always use the highest precedent value. >> +SECCOMP_RET_KILL will always take precedence. > > I think this clarification about precedence is good but should be at the > head of the "Return values" section, and the sections ordered from that > perspective, so that the "highest precedent value" aspect is a little > bit easier to follow: > > > Return values > ------------- > A seccomp filter may return any of the following values. If multiple > filters exist, the return value for the evaluation of a given system > call will always use the highest precedent value. (For example, > SECCOMP_RET_KILL will always take precedence.) > > In precedence order, they are: > > SECCOMP_RET_KILL: > ? ? ? ?If any filters for a given take return this value then > ? ? ? ?the task will exit immediately without executing the system > ? ? ? ?call. > > SECCOMP_RET_TRAP: > ? ? ? ?If any filters specify SECCOMP_RET_TRAP and none of them > ? ? ? ?specify SECCOMP_RET_KILL, then the kernel will send a SIGSYS > ? ? ? ?signal to the task and not execute the system call. The kernel > ? ? ? ?will rollback the register state to just before system call > ? ? ? ?entry such that a signal handler in the process will be able > ? ? ? ?to inspect the ucontext_t->uc_mcontext registers and emulate > ? ? ? ?system call success or failure upon return from the signal > ? ? ? ?handler. > > ? ? ? ?The SIGSYS is differentiated by other SIGSYS signals by a si_code > ? ? ? ?of SYS_SECCOMP. > > SECCOMP_RET_ERRNO: > ? ? ? ?If any filters return this value and none of them specify a > ? ? ? ?higher precedence value, then the lowest of the values provided > ? ? ? ?in the lower 16-bits is returned to userland as the errno and > ? ? ? ?the system call is not executed. > > SECCOMP_RET_TRACE: > ? ? ? ?If any filters return this value and none of them specify a > ? ? ? ?higher precedence value, then the kernel will attempt to notify > ? ? ? ?a ptrace()-based tracer prior to executing the system call. > > ? ? ? ?A tracer will be notified if it requests PTRACE_O_TRACESECCOMP > ? ? ? ?via PTRACE_SETOPTIONS. Otherwise, the system call will > ? ? ? ?not execute and -ENOSYS will be returned to userspace. > ? ? ? ?If the tracer ignores notification, then the system call will > ? ? ? ?proceed normally. Changes to the registers will function > ? ? ? ?similarly to PTRACE_SYSCALL. Additionally, if the tracer > ? ? ? ?detaches during notification or just after, the task may be > ? ? ? ?terminated as precautionary measure. > > SECCOMP_RET_ALLOW: > ? ? ? ?If all filters for a given task return this value then > ? ? ? ?the system call will proceed normally. > Thanks! I'll integrate all of this and post a full v11 series in the morning (depending on any feedback trickling later :). cheers, will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/