Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753954AbcCIU7O (ORCPT ); Wed, 9 Mar 2016 15:59:14 -0500 Received: from mail-oi0-f50.google.com ([209.85.218.50]:35904 "EHLO mail-oi0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751999AbcCIU7M (ORCPT ); Wed, 9 Mar 2016 15:59:12 -0500 MIME-Version: 1.0 In-Reply-To: <56DF38BA.9030007@mellanox.com> References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <56D895EA.1060301@mellanox.com> <56DDE9C9.5060900@mellanox.com> <56DF38BA.9030007@mellanox.com> From: Andy Lutomirski Date: Wed, 9 Mar 2016 12:58:52 -0800 Message-ID: Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality To: Chris Metcalf , Kees Cook Cc: Thomas Gleixner , Christoph Lameter , Andrew Morton , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Gilad Ben Yossef , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-kernel@vger.kernel.org" , X86 ML , "H. Peter Anvin" , Catalin Marinas , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12440 Lines: 319 On Tue, Mar 8, 2016 at 12:40 PM, Chris Metcalf wrote: > On 03/07/2016 03:55 PM, Andy Lutomirski wrote: >>>> >>>> Let task isolation users who want to detect when they screw up and do >>>> >>a syscall do it with seccomp. >>> >>> >>> >Can you give me more details on what you're imagining here? Remember >>> >that a key use case is that these applications can remove the syscall >>> >prohibition voluntarily; it's only there to prevent unintended uses >>> >(by third party libraries or just straight-up programming bugs). >>> >As far as I can tell, seccomp does not allow you to go from "less >>> >permissive" to "more permissive" settings at all, which means that as >>> >it exists, it's not a good solution for this use case. >>> > >>> >Or were you thinking about a new seccomp API that allows this? >> >> I was. This is at least the second time I've wanted a way to ask >> seccomp to allow a layer to be removed. > > > Andy, > > Please take a look at this draft patch that intends to enable seccomp > as something that task isolation can use. Kees, this sounds like it may solve your self-instrumentation problem. Want to take a look? --Andy > > The basic idea is to add a notion of "removable" seccomp filters. > You can tag a filter that way when installing it (using the new > SECCOMP_FILTER_FLAG_REMOVABLE flag bit for SECCOMP_SET_MODE_FILTER), > and if the most recently-added filter is marked as removable, you can > remove it with the new SECCOMP_POP_FILTER operation. It is currently > implemented to be incompatible with SECCOMP_FILTER_FLAG_TSYNC, which > is plausible since the obvious use is for thread-local push and pop, > but the API allows for future implementation by including a flag word > with the pop_filter operation (now always zero). > > I did not make this supported via the prctl() since the "removable" > flag requires seccomp(), so making pop work with prctl() seemed silly. > > One interesting result of this is that now it is no longer true > that once current->seccomp.mode becomes non-zero, it may not be > changed, since it can now be changed back to DISABLED when you push a > removable filter and later pop it. > > My preference would be not to have to require all task-isolation users > to also figure out all the complexities of creating BPF programs, so > my intention is to have task isolation automatically generate a BPF > program (just allowing prctl/exit/exit_group and failing everything > else with SIGSYS). To support having it work this way, I open up > the seccomp stuff a little so that kernel clients can effectively > push/pop a BPF program into seccomp: That sounds like a great use case for the new libtaskisolation that someone is surely writing :) > > long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp) > long seccomp_pop_filter(unsigned int flags); > > We mark filters from this API with a new "extern_prog" boolean in the > seccomp_filter struct so the BPF program isn't freed when the > seccomp_filter itself is freed. Note that doing it this way avoids > having to go through the substantial overhead of creating a brand-new > BPF filter every time we enter task isolation mode. > > Not shown here is the additional code needed in task isolation to > create a suitable BPF program and then push and pop it as we go in and > out of task isolation mode. > > For what it's worth, I'm a little dubious about the tradeoff of adding > a substantial chunk of code to seccomp to handle what the v10 task > isolation code did with a single extra TIF flag test and a dozen lines > of code that got called. But given that you said there were other > potential users for the "filter pop" idea, it may indeed make sense. > > This is still all untested, but I wanted to get your sense of whether > this was even going in the right direction before spending more time > on it. > > Thanks! > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 2296e6b2f690..feeba7a23d20 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -3,13 +3,15 @@ > #include > -#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC) > +#define SECCOMP_FILTER_FLAG_MASK \ > + (SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_REMOVABLE) > #ifdef CONFIG_SECCOMP > #include > #include > +struct bpf_prog; > struct seccomp_filter; > /** > * struct seccomp - the state of a seccomp'ed process > @@ -41,6 +43,8 @@ static inline int secure_computing(void) > extern u32 seccomp_phase1(struct seccomp_data *sd); > int seccomp_phase2(u32 phase1_result); > +long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp); > +long seccomp_pop_filter(unsigned int flags); > #else > extern void secure_computing_strict(int this_syscall); > #endif > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index 0f238a43ff1e..6e65ac2a7262 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -13,9 +13,11 @@ > /* Valid operations for seccomp syscall. */ > #define SECCOMP_SET_MODE_STRICT 0 > #define SECCOMP_SET_MODE_FILTER 1 > +#define SECCOMP_POP_FILTER 2 > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > #define SECCOMP_FILTER_FLAG_TSYNC 1 > +#define SECCOMP_FILTER_FLAG_REMOVABLE 2 > /* > * All BPF programs must return a 32-bit value. > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 15a1795bbba1..c22eb3a56556 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -41,8 +41,9 @@ > * outside of a lifetime-guarded section. In general, this > * is only needed for handling filters shared across tasks. > * @prev: points to a previously installed, or inherited, filter > - * @len: the number of instructions in the program > - * @insnsi: the BPF program instructions to evaluate > + * @prog: the BPF program to evaluate > + * @removable: if this filter is removable with seccomp_pop_filter() > + * @extern_prog: if @prog should not be freed in seccomp_free_filter() > * > * seccomp_filter objects are organized in a tree linked via the @prev > * pointer. For any task, it appears to be a singly-linked list starting > @@ -58,6 +59,8 @@ struct seccomp_filter { > atomic_t usage; > struct seccomp_filter *prev; > struct bpf_prog *prog; > + bool removable; > + bool extern_prog; > }; > /* Limit any path through the tree to 256KB worth of instructions. */ > @@ -470,7 +473,8 @@ void get_seccomp_filter(struct task_struct *tsk) > static inline void seccomp_filter_free(struct seccomp_filter *filter) > { > if (filter) { > - bpf_prog_destroy(filter->prog); > + if (!filter->extern_prog) > + bpf_prog_destroy(filter->prog); > kfree(filter); > } > } > @@ -722,6 +726,7 @@ long prctl_get_seccomp(void) > * seccomp_set_mode_strict: internal function for setting strict seccomp > * > * Once current->seccomp.mode is non-zero, it may not be changed. > + * (other than to reset to DISABLED after removing the last removable > filter). > * > * Returns 0 on success or -EINVAL on failure. > */ > @@ -749,33 +754,34 @@ out: > #ifdef CONFIG_SECCOMP_FILTER > /** > - * seccomp_set_mode_filter: internal function for setting seccomp filter > + * do_push_filter: internal function for setting seccomp filter > * @flags: flags to change filter behavior > - * @filter: struct sock_fprog containing filter > + * @prepared: struct seccomp_filter to install > * > * This function may be called repeatedly to install additional filters. > * Every filter successfully installed will be evaluated (in reverse order) > * for each system call the task makes. > * > - * Once current->seccomp.mode is non-zero, it may not be changed. > + * Once current->seccomp.mode is non-zero, it may not be changed > + * (other than to reset to DISABLED after removing the last removable > filter). > * > * Returns 0 on success or -EINVAL on failure. > */ > -static long seccomp_set_mode_filter(unsigned int flags, > - const char __user *filter) > +long do_push_filter(unsigned int flags, struct seccomp_filter *prepared) > { > const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; > - struct seccomp_filter *prepared = NULL; > long ret = -EINVAL; > /* Validate flags. */ > if (flags & ~SECCOMP_FILTER_FLAG_MASK) > return -EINVAL; > - /* Prepare the new filter before holding any locks. */ > - prepared = seccomp_prepare_user_filter(filter); > - if (IS_ERR(prepared)) > - return PTR_ERR(prepared); > + if (flags & SECCOMP_FILTER_FLAG_REMOVABLE) { > + /* The intended use case is for thread-local push/pop. */ > + if (flags & SECCOMP_FILTER_FLAG_TSYNC) > + goto out_free; > + prepared->removable = true; > + } > /* > * Make sure we cannot change seccomp or nnp state via TSYNC > @@ -805,12 +811,87 @@ out_free: > seccomp_filter_free(prepared); > return ret; > } > + > +static long seccomp_set_mode_filter(unsigned int flags, > + const char __user *filter) > +{ > + struct seccomp_filter *prepared; > + > + /* Prepare the new filter before holding any locks. */ > + prepared = seccomp_prepare_user_filter(filter); > + if (IS_ERR(prepared)) > + return PTR_ERR(prepared); > + return seccomp_push_filter(flags, prepared); > +} > + > +long seccomp_push_filter(unsigned int flags, struct bpf_prog *fp) > +{ > + struct seccomp_filter *sfilter; > + > + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL); > + if (!sfilter) > + return ERR_PTR(-ENOMEM); > + > + sfilter->prog = fp; > + sfilter->extern_prog = true; > + atomic_set(&sfilter->usage, 1); > + > + return do_push_filter(flags, sfilter); > +} > + > +/** > + * seccomp_pop_filter: internal function for removing filter > + * @flags: flags to change pop behavior > + * > + * This function removes the most recently installed filter, if it was > + * installed with the SECCOMP_FILTER_FLAG_REMOVABLE flag. Any previously > + * installed filters are left intact. > + * > + * If the last filter is removed, the seccomp state reverts to DISABLED. > + * > + * Returns 0 on success or -EINVAL on failure. > + */ > +long seccomp_pop_filter(unsigned int flags) > +{ > + struct seccomp_filter *filter; > + > + /* The intended use case is for temporary thread-local push/pop. */ > + if (flags & SECCOMP_FILTER_FLAG_TSYNC) > + return -EINVAL; > + > + spin_lock_irq(¤t->sighand->siglock); > + > + if (current->seccomp.mode != SECCOMP_MODE_FILTER) > + goto out; > + > + filter = current->seccomp.filter; > + if (unlikely(WARN_ON(filter == NULL)) || !filter->removable) > + goto out; > + > + if (filter->prev == NULL) { > + clear_tsk_thread_flag(current, TIF_SECCOMP); > + current->seccomp.mode = SECCOMP_MODE_DISABLED; > + } > + > + current->seccomp.filter = filter->prev; > + > + spin_unlock_irq(¤t->sighand->siglock); > + seccomp_filter_free(filter); > + return 0; > +out: > + spin_unlock_irq(¤t->sighand->siglock); > + return -EINVAL; > +} > #else > static inline long seccomp_set_mode_filter(unsigned int flags, > const char __user *filter) > { > return -EINVAL; > } > +static inline long seccomp_pop_filter(unsigned int flags) > +{ > + return -EINVAL; > +} > #endif > /* Common entry point for both prctl and syscall. */ > @@ -824,6 +905,8 @@ static long do_seccomp(unsigned int op, unsigned int > flags, > return seccomp_set_mode_strict(); > case SECCOMP_SET_MODE_FILTER: > return seccomp_set_mode_filter(flags, uargs); > + case SECCOMP_POP_FILTER: > + return seccomp_pop_filter(flags); > default: > return -EINVAL; > > } > > -- > Chris Metcalf, Mellanox Technologies > http://www.mellanox.com > -- Andy Lutomirski AMA Capital Management, LLC