Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932097AbcCGUzx (ORCPT ); Mon, 7 Mar 2016 15:55:53 -0500 Received: from mail-oi0-f54.google.com ([209.85.218.54]:34240 "EHLO mail-oi0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753260AbcCGUzo (ORCPT ); Mon, 7 Mar 2016 15:55:44 -0500 MIME-Version: 1.0 In-Reply-To: <56DDE9C9.5060900@mellanox.com> References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <56D895EA.1060301@mellanox.com> <56DDE9C9.5060900@mellanox.com> From: Andy Lutomirski Date: Mon, 7 Mar 2016 12:55:24 -0800 Message-ID: Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality To: Chris Metcalf Cc: Thomas Gleixner , Christoph Lameter , Andrew Morton , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Gilad Ben Yossef , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-kernel@vger.kernel.org" , X86 ML , "H. Peter Anvin" , Catalin Marinas , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4777 Lines: 104 On Mon, Mar 7, 2016 at 12:51 PM, Chris Metcalf wrote: > On 03/03/2016 06:46 PM, Andy Lutomirski wrote: >> >> On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf >> wrote: >>> >>> On 03/02/2016 07:36 PM, Andy Lutomirski wrote: >>>> >>>> On Mar 2, 2016 12:10 PM, "Chris Metcalf" wrote: >>>>> >>>>> In prepare_exit_to_usermode(), call task_isolation_ready() >>>>> when we are checking the thread-info flags, and after we've handled >>>>> the other work, call task_isolation_enter() unconditionally. >>>>> >>>>> In syscall_trace_enter_phase1(), we add the necessary support for >>>>> strict-mode detection of syscalls. >>>>> [...] >>>>> >>>>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct >>>>> pt_regs *regs, u32 arch) >>>>> */ >>>>> if (work & _TIF_NOHZ) { >>>>> enter_from_user_mode(); >>>>> + if (task_isolation_check_syscall(regs->orig_ax)) { >>>>> + regs->orig_ax = -1; >>>>> + return 0; >>>>> + } >>>> >>>> This needs a comment indicating the intended semantics. >>>> And I've still heard no explanation of why this part can't use seccomp. >>> >>> >>> Here's an excerpt from my earlier reply to you from: >>> >>> https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com >>> >>> Admittedly this patch series has been moving very slowly through >>> review, so it's not surprising we have to revisit some things! >>> >>> On 07/21/2015 03:34 PM, Chris Metcalf wrote: >>>> >>>> On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >>>>> >>>>> If a user wants a syscall to kill them, use >>>>> seccomp. The kernel isn't at fault if the user does a syscall when it >>>>> didn't want to enter the kernel. >>>> >>>> >>>> Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT >>>> was to what I wanted here. One concern is that there doesn't seem >>>> to be a way to "escape" from seccomp strict mode, i.e. you can't >>>> call seccomp() again to turn it off - which makes sense for seccomp >>>> since it's a security issue, but not so much sense with cpu_isolated. >>>> >>>> So, do you think there's a good role for the seccomp() API to play >>>> in achieving this goal? It's certainly not a question of "the kernel at >>>> fault" but rather "asking the kernel to help catch user mistakes" >>>> (typically third-party libraries in our customers' experience). You >>>> could imagine a SECCOMP_SET_MODE_ISOLATED or something. >>>> >>>> Alternatively, we could stick with the API proposed in my patch >>>> series, or something similar, and just try to piggy-back on the seccomp >>>> internals to make it happen. It would require Kconfig to ensure >>>> that SECCOMP was enabled though, which obviously isn't currently >>>> required to do cpu isolation. >>> >>> >>> On looking at this again just now, one thing that strikes me is that >>> it may not be necessary to forbid the syscall like seccomp does. >>> It may be sufficient just to trigger the task isolation strict signal >>> and then allow the syscall to complete. After all, we don't "fail" >>> any of the other things that upset strict mode, like page faults; we >>> let them complete, but add a signal. So for consistency, I think it >>> may in fact make sense to simply trigger the signal but let the >>> syscall do its thing. After all, perhaps the signal is handled >>> and logged and we don't mind having the application continue; the >>> signal handler can certainly choose to fail hard, or in the usual >>> case of no signal handler, that kills the task just fine too. >>> Allowing the syscall to complete is really kind of incidental. >> >> No, don't do that. First, if you have a signal pending, a lot of >> syscalls will abort with -EINTR. Second, if you fire a signal on >> entry via sigreturn, you're not going to like the results. > > > OK, you've convinced me to stick with the previous model of just > forbidding the syscall in this case. > >> Let task isolation users who want to detect when they screw up and do >> a syscall do it with seccomp. > > > Can you give me more details on what you're imagining here? Remember > that a key use case is that these applications can remove the syscall > prohibition voluntarily; it's only there to prevent unintended uses > (by third party libraries or just straight-up programming bugs). > As far as I can tell, seccomp does not allow you to go from "less > permissive" to "more permissive" settings at all, which means that as > it exists, it's not a good solution for this use case. > > Or were you thinking about a new seccomp API that allows this? I was. This is at least the second time I've wanted a way to ask seccomp to allow a layer to be removed.