Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934063AbcCIVK5 (ORCPT ); Wed, 9 Mar 2016 16:10:57 -0500 Received: from mail-ig0-f173.google.com ([209.85.213.173]:35669 "EHLO mail-ig0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753593AbcCIVKt (ORCPT ); Wed, 9 Mar 2016 16:10:49 -0500 MIME-Version: 1.0 In-Reply-To: References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <56D895EA.1060301@mellanox.com> <56DDE9C9.5060900@mellanox.com> <56DF38BA.9030007@mellanox.com> Date: Wed, 9 Mar 2016 13:10:48 -0800 X-Google-Sender-Auth: 6aG9jaYbJKiFdvnn4rVKw4zIE9U Message-ID: Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality From: Kees Cook To: Andy Lutomirski Cc: Chris Metcalf , Thomas Gleixner , Christoph Lameter , Andrew Morton , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Gilad Ben Yossef , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-kernel@vger.kernel.org" , X86 ML , "H. Peter Anvin" , Catalin Marinas , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13266 Lines: 338 On Wed, Mar 9, 2016 at 12:58 PM, Andy Lutomirski wrote: > On Tue, Mar 8, 2016 at 12:40 PM, Chris Metcalf wrote: >> On 03/07/2016 03:55 PM, Andy Lutomirski wrote: >>>>> >>>>> Let task isolation users who want to detect when they screw up and do >>>>> >>a syscall do it with seccomp. >>>> >>>> >>>> >Can you give me more details on what you're imagining here? Remember >>>> >that a key use case is that these applications can remove the syscall >>>> >prohibition voluntarily; it's only there to prevent unintended uses >>>> >(by third party libraries or just straight-up programming bugs). >>>> >As far as I can tell, seccomp does not allow you to go from "less >>>> >permissive" to "more permissive" settings at all, which means that as >>>> >it exists, it's not a good solution for this use case. >>>> > >>>> >Or were you thinking about a new seccomp API that allows this? >>> >>> I was. This is at least the second time I've wanted a way to ask >>> seccomp to allow a layer to be removed. >> >> >> Andy, >> >> Please take a look at this draft patch that intends to enable seccomp >> as something that task isolation can use. > > Kees, this sounds like it may solve your self-instrumentation problem. > Want to take a look? Errrr... I'm pretty uncomfortable with this. I really would like to keep the basic semantics of seccomp is simple as possible: filtering only gets more restricted. This doesn't really solve my self-instrumentation desires since I still can't sanely deliver signals. I would need a lot more convincing. :) > >> >> The basic idea is to add a notion of "removable" seccomp filters. >> You can tag a filter that way when installing it (using the new >> SECCOMP_FILTER_FLAG_REMOVABLE flag bit for SECCOMP_SET_MODE_FILTER), >> and if the most recently-added filter is marked as removable, you can >> remove it with the new SECCOMP_POP_FILTER operation. It is currently >> implemented to be incompatible with SECCOMP_FILTER_FLAG_TSYNC, which >> is plausible since the obvious use is for thread-local push and pop, >> but the API allows for future implementation by including a flag word >> with the pop_filter operation (now always zero). >> >> I did not make this supported via the prctl() since the "removable" >> flag requires seccomp(), so making pop work with prctl() seemed silly. >> >> One interesting result of this is that now it is no longer true >> that once current->seccomp.mode becomes non-zero, it may not be >> changed, since it can now be changed back to DISABLED when you push a >> removable filter and later pop it. >> >> My preference would be not to have to require all task-isolation users >> to also figure out all the complexities of creating BPF programs, so >> my intention is to have task isolation automatically generate a BPF >> program (just allowing prctl/exit/exit_group and failing everything >> else with SIGSYS). To support having it work this way, I open up >> the seccomp stuff a little so that kernel clients can effectively >> push/pop a BPF program into seccomp: > > That sounds like a great use case for the new libtaskisolation that > someone is surely writing :) > >> >> long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp) >> long seccomp_pop_filter(unsigned int flags); >> >> We mark filters from this API with a new "extern_prog" boolean in the >> seccomp_filter struct so the BPF program isn't freed when the >> seccomp_filter itself is freed. Note that doing it this way avoids >> having to go through the substantial overhead of creating a brand-new >> BPF filter every time we enter task isolation mode. >> >> Not shown here is the additional code needed in task isolation to >> create a suitable BPF program and then push and pop it as we go in and >> out of task isolation mode. >> >> For what it's worth, I'm a little dubious about the tradeoff of adding >> a substantial chunk of code to seccomp to handle what the v10 task >> isolation code did with a single extra TIF flag test and a dozen lines >> of code that got called. But given that you said there were other >> potential users for the "filter pop" idea, it may indeed make sense. I think the extra TIF flag makes more sense than overloading seccomp. -Kees >> >> This is still all untested, but I wanted to get your sense of whether >> this was even going in the right direction before spending more time >> on it. >> >> Thanks! >> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >> index 2296e6b2f690..feeba7a23d20 100644 >> --- a/include/linux/seccomp.h >> +++ b/include/linux/seccomp.h >> @@ -3,13 +3,15 @@ >> #include >> -#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC) >> +#define SECCOMP_FILTER_FLAG_MASK \ >> + (SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_REMOVABLE) >> #ifdef CONFIG_SECCOMP >> #include >> #include >> +struct bpf_prog; >> struct seccomp_filter; >> /** >> * struct seccomp - the state of a seccomp'ed process >> @@ -41,6 +43,8 @@ static inline int secure_computing(void) >> extern u32 seccomp_phase1(struct seccomp_data *sd); >> int seccomp_phase2(u32 phase1_result); >> +long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp); >> +long seccomp_pop_filter(unsigned int flags); >> #else >> extern void secure_computing_strict(int this_syscall); >> #endif >> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h >> index 0f238a43ff1e..6e65ac2a7262 100644 >> --- a/include/uapi/linux/seccomp.h >> +++ b/include/uapi/linux/seccomp.h >> @@ -13,9 +13,11 @@ >> /* Valid operations for seccomp syscall. */ >> #define SECCOMP_SET_MODE_STRICT 0 >> #define SECCOMP_SET_MODE_FILTER 1 >> +#define SECCOMP_POP_FILTER 2 >> /* Valid flags for SECCOMP_SET_MODE_FILTER */ >> #define SECCOMP_FILTER_FLAG_TSYNC 1 >> +#define SECCOMP_FILTER_FLAG_REMOVABLE 2 >> /* >> * All BPF programs must return a 32-bit value. >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index 15a1795bbba1..c22eb3a56556 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -41,8 +41,9 @@ >> * outside of a lifetime-guarded section. In general, this >> * is only needed for handling filters shared across tasks. >> * @prev: points to a previously installed, or inherited, filter >> - * @len: the number of instructions in the program >> - * @insnsi: the BPF program instructions to evaluate >> + * @prog: the BPF program to evaluate >> + * @removable: if this filter is removable with seccomp_pop_filter() >> + * @extern_prog: if @prog should not be freed in seccomp_free_filter() >> * >> * seccomp_filter objects are organized in a tree linked via the @prev >> * pointer. For any task, it appears to be a singly-linked list starting >> @@ -58,6 +59,8 @@ struct seccomp_filter { >> atomic_t usage; >> struct seccomp_filter *prev; >> struct bpf_prog *prog; >> + bool removable; >> + bool extern_prog; >> }; >> /* Limit any path through the tree to 256KB worth of instructions. */ >> @@ -470,7 +473,8 @@ void get_seccomp_filter(struct task_struct *tsk) >> static inline void seccomp_filter_free(struct seccomp_filter *filter) >> { >> if (filter) { >> - bpf_prog_destroy(filter->prog); >> + if (!filter->extern_prog) >> + bpf_prog_destroy(filter->prog); >> kfree(filter); >> } >> } >> @@ -722,6 +726,7 @@ long prctl_get_seccomp(void) >> * seccomp_set_mode_strict: internal function for setting strict seccomp >> * >> * Once current->seccomp.mode is non-zero, it may not be changed. >> + * (other than to reset to DISABLED after removing the last removable >> filter). >> * >> * Returns 0 on success or -EINVAL on failure. >> */ >> @@ -749,33 +754,34 @@ out: >> #ifdef CONFIG_SECCOMP_FILTER >> /** >> - * seccomp_set_mode_filter: internal function for setting seccomp filter >> + * do_push_filter: internal function for setting seccomp filter >> * @flags: flags to change filter behavior >> - * @filter: struct sock_fprog containing filter >> + * @prepared: struct seccomp_filter to install >> * >> * This function may be called repeatedly to install additional filters. >> * Every filter successfully installed will be evaluated (in reverse order) >> * for each system call the task makes. >> * >> - * Once current->seccomp.mode is non-zero, it may not be changed. >> + * Once current->seccomp.mode is non-zero, it may not be changed >> + * (other than to reset to DISABLED after removing the last removable >> filter). >> * >> * Returns 0 on success or -EINVAL on failure. >> */ >> -static long seccomp_set_mode_filter(unsigned int flags, >> - const char __user *filter) >> +long do_push_filter(unsigned int flags, struct seccomp_filter *prepared) >> { >> const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; >> - struct seccomp_filter *prepared = NULL; >> long ret = -EINVAL; >> /* Validate flags. */ >> if (flags & ~SECCOMP_FILTER_FLAG_MASK) >> return -EINVAL; >> - /* Prepare the new filter before holding any locks. */ >> - prepared = seccomp_prepare_user_filter(filter); >> - if (IS_ERR(prepared)) >> - return PTR_ERR(prepared); >> + if (flags & SECCOMP_FILTER_FLAG_REMOVABLE) { >> + /* The intended use case is for thread-local push/pop. */ >> + if (flags & SECCOMP_FILTER_FLAG_TSYNC) >> + goto out_free; >> + prepared->removable = true; >> + } >> /* >> * Make sure we cannot change seccomp or nnp state via TSYNC >> @@ -805,12 +811,87 @@ out_free: >> seccomp_filter_free(prepared); >> return ret; >> } >> + >> +static long seccomp_set_mode_filter(unsigned int flags, >> + const char __user *filter) >> +{ >> + struct seccomp_filter *prepared; >> + >> + /* Prepare the new filter before holding any locks. */ >> + prepared = seccomp_prepare_user_filter(filter); >> + if (IS_ERR(prepared)) >> + return PTR_ERR(prepared); >> + return seccomp_push_filter(flags, prepared); >> +} >> + >> +long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp) >> +{ >> + struct seccomp_filter *sfilter; >> + >> + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL); >> + if (!sfilter) >> + return ERR_PTR(-ENOMEM); >> + >> + sfilter->prog = fp; >> + sfilter->extern_prog = true; >> + atomic_set(&sfilter->usage, 1); >> + >> + return do_push_filter(flags, sfilter); >> +} >> + >> +/** >> + * seccomp_pop_filter: internal function for removing filter >> + * @flags: flags to change pop behavior >> + * >> + * This function removes the most recently installed filter, if it was >> + * installed with the SECCOMP_FILTER_FLAG_REMOVABLE flag. Any previously >> + * installed filters are left intact. >> + * >> + * If the last filter is removed, the seccomp state reverts to DISABLED. >> + * >> + * Returns 0 on success or -EINVAL on failure. >> + */ >> +long seccomp_pop_filter(unsigned int flags) >> +{ >> + struct seccomp_filter *filter; >> + >> + /* The intended use case is for temporary thread-local push/pop. */ >> + if (flags & SECCOMP_FILTER_FLAG_TSYNC) >> + return -EINVAL; >> + >> + spin_lock_irq(¤t->sighand->siglock); >> + >> + if (current->seccomp.mode != SECCOMP_MODE_FILTER) >> + goto out; >> + >> + filter = current->seccomp.filter; >> + if (unlikely(WARN_ON(filter == NULL)) || !filter->removable) >> + goto out; >> + >> + if (filter->prev == NULL) { >> + clear_tsk_thread_flag(current, TIF_SECCOMP); >> + current->seccomp.mode = SECCOMP_MODE_DISABLED; >> + } >> + >> + current->seccomp.filter = filter->prev; >> + >> + spin_unlock_irq(¤t->sighand->siglock); >> + seccomp_filter_free(filter); >> + return 0; >> +out: >> + spin_unlock_irq(¤t->sighand->siglock); >> + return -EINVAL; >> +} >> #else >> static inline long seccomp_set_mode_filter(unsigned int flags, >> const char __user *filter) >> { >> return -EINVAL; >> } >> +static inline long seccomp_pop_filter(unsigned int flags) >> +{ >> + return -EINVAL; >> +} >> #endif >> /* Common entry point for both prctl and syscall. */ >> @@ -824,6 +905,8 @@ static long do_seccomp(unsigned int op, unsigned int >> flags, >> return seccomp_set_mode_strict(); >> case SECCOMP_SET_MODE_FILTER: >> return seccomp_set_mode_filter(flags, uargs); >> + case SECCOMP_POP_FILTER: >> + return seccomp_pop_filter(flags); >> default: >> return -EINVAL; >> >> } >> >> -- >> Chris Metcalf, Mellanox Technologies >> http://www.mellanox.com >> > > > > -- > Andy Lutomirski > AMA Capital Management, LLC -- Kees Cook Chrome OS & Brillo Security