Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754194AbcCIV6R (ORCPT ); Wed, 9 Mar 2016 16:58:17 -0500 Received: from mail-oi0-f51.google.com ([209.85.218.51]:36426 "EHLO mail-oi0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753094AbcCIV6C (ORCPT ); Wed, 9 Mar 2016 16:58:02 -0500 MIME-Version: 1.0 In-Reply-To: References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <56D895EA.1060301@mellanox.com> <56DDE9C9.5060900@mellanox.com> <56DF38BA.9030007@mellanox.com> From: Andy Lutomirski Date: Wed, 9 Mar 2016 13:57:41 -0800 Message-ID: Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality To: Kees Cook , Kenton Varda Cc: Thomas Gleixner , Christoph Lameter , X86 ML , Andrew Morton , Viresh Kumar , Steven Rostedt , Ingo Molnar , Tejun Heo , Gilad Ben Yossef , Rik van Riel , Will Deacon , Frederic Weisbecker , "Paul E. McKenney" , Chris Metcalf , "linux-kernel@vger.kernel.org" , Catalin Marinas , "H. Peter Anvin" , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4285 Lines: 94 [adding Kenton -- you do interesting things with seccomp, too] On Mar 9, 2016 1:25 PM, "Kees Cook" wrote: > > On Wed, Mar 9, 2016 at 1:18 PM, Andy Lutomirski wrote: > > On Wed, Mar 9, 2016 at 1:10 PM, Kees Cook wrote: > >> On Wed, Mar 9, 2016 at 12:58 PM, Andy Lutomirski wrote: > >>> On Tue, Mar 8, 2016 at 12:40 PM, Chris Metcalf wrote: > >>>> On 03/07/2016 03:55 PM, Andy Lutomirski wrote: > >>>>>>> > >>>>>>> Let task isolation users who want to detect when they screw up and do > >>>>>>> >>a syscall do it with seccomp. > >>>>>> > >>>>>> > >>>>>> >Can you give me more details on what you're imagining here? Remember > >>>>>> >that a key use case is that these applications can remove the syscall > >>>>>> >prohibition voluntarily; it's only there to prevent unintended uses > >>>>>> >(by third party libraries or just straight-up programming bugs). > >>>>>> >As far as I can tell, seccomp does not allow you to go from "less > >>>>>> >permissive" to "more permissive" settings at all, which means that as > >>>>>> >it exists, it's not a good solution for this use case. > >>>>>> > > >>>>>> >Or were you thinking about a new seccomp API that allows this? > >>>>> > >>>>> I was. This is at least the second time I've wanted a way to ask > >>>>> seccomp to allow a layer to be removed. > >>>> > >>>> > >>>> Andy, > >>>> > >>>> Please take a look at this draft patch that intends to enable seccomp > >>>> as something that task isolation can use. > >>> > >>> Kees, this sounds like it may solve your self-instrumentation problem. > >>> Want to take a look? > >> > >> Errrr... I'm pretty uncomfortable with this. I really would like to > >> keep the basic semantics of seccomp is simple as possible: filtering > >> only gets more restricted. > > The other problem is that this won't work if the third-party code > actually uses seccomp itself... this isn't composable as-is. It kind of is. Set it up to trap to SIGSYS on any unexpected seccomp() call. Then emulate it. To make this slightly cleaner, there could be a variant of the flag in which a poppable seccomp filter is set to pop itself if it generates SIGSYS. Then you'd make sure to always trap seccomp and sigaction -- you'd have to emulate those two for composability. Presumably, for sanity, it would be illegal to have any filter at all stacked on top of this type of filter. We'd also probably want to prevent installation of a non-poppable filter on top of a poppable filter -- that wouldn't make much sense. Just to muddy the waters, there's another possible use case for this: a sandbox program could mprotect all its critical data structures readonly or even PROT_NONE, set up sigaltstack and a very carefully written SIGSYS handler, install a self-popping signal handler that turns *everything* into SIGSYS, and then jump to untrusted code. Now we can finally have a trusted in-process supervisor that can't be tampered with because it's only privileged if it's entered through a special gate (i.e. SIGSYS). > > >> > >> This doesn't really solve my self-instrumentation desires since I > >> still can't sanely deliver signals. I would need a lot more > >> convincing. :) > >> > > > > I think you could do it by adding a filter that turns all the unknown > > things into SIGSYS, allows sigreturn, and allows the seccomp syscall, > > at least in the pop-off-the-filter variant. Then you add this > > removably. > > > > In the SIGSYS handler, you pop off the filter, do your bookkeeping, > > update the filter, and push it back on. > > No, this won't let the original syscall through. I wanted to be able > to document the syscalls as they happened without needing audit or a > ptrace monitor. I am currently convinced that my desire for this is no > good, and it should just be done with a ptrace monitor... It can let the original through -- just do the arch-specific restart incantation before you return. On x86, reload EAX/RAX and backtrack IP by 2. As of Linux 4.4, the code for this is sane and has a good test case. On older 64-bit kernels on AMD running 32-bit code, there could be odd side effects. If I ever factor out my SIGSYS decoder, I'll add a restart helper. --Andy