Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755931AbcCCXqx (ORCPT ); Thu, 3 Mar 2016 18:46:53 -0500 Received: from mail-oi0-f54.google.com ([209.85.218.54]:33663 "EHLO mail-oi0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751010AbcCCXqv (ORCPT ); Thu, 3 Mar 2016 18:46:51 -0500 MIME-Version: 1.0 In-Reply-To: <56D895EA.1060301@mellanox.com> References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <56D895EA.1060301@mellanox.com> From: Andy Lutomirski Date: Thu, 3 Mar 2016 15:46:30 -0800 Message-ID: Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality To: Chris Metcalf Cc: Thomas Gleixner , Christoph Lameter , Andrew Morton , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Gilad Ben Yossef , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-kernel@vger.kernel.org" , X86 ML , "H. Peter Anvin" , Catalin Marinas , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6848 Lines: 169 On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf wrote: > On 03/02/2016 07:36 PM, Andy Lutomirski wrote: >> >> On Mar 2, 2016 12:10 PM, "Chris Metcalf" wrote: >>> >>> In prepare_exit_to_usermode(), call task_isolation_ready() >>> when we are checking the thread-info flags, and after we've handled >>> the other work, call task_isolation_enter() unconditionally. >>> >>> In syscall_trace_enter_phase1(), we add the necessary support for >>> strict-mode detection of syscalls. >>> >>> We add strict reporting for the kernel exception types that do >>> not result in signals, namely non-signalling page faults and >>> non-signalling MPX fixups. >>> >>> Signed-off-by: Chris Metcalf >>> --- >>> arch/x86/entry/common.c | 18 ++++++++++++++++-- >>> arch/x86/kernel/traps.c | 2 ++ >>> arch/x86/mm/fault.c | 2 ++ >>> 3 files changed, 20 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >>> index 03663740c866..27c71165416b 100644 >>> --- a/arch/x86/entry/common.c >>> +++ b/arch/x86/entry/common.c >>> @@ -21,6 +21,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> >>> #include >>> #include >>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct >>> pt_regs *regs, u32 arch) >>> */ >>> if (work & _TIF_NOHZ) { >>> enter_from_user_mode(); >>> + if (task_isolation_check_syscall(regs->orig_ax)) { >>> + regs->orig_ax = -1; >>> + return 0; >>> + } >> >> This needs a comment indicating the intended semantics. >> And I've still heard no explanation of why this part can't use seccomp. > > > Here's an excerpt from my earlier reply to you from: > > https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com > > Admittedly this patch series has been moving very slowly through > review, so it's not surprising we have to revisit some things! > > On 07/21/2015 03:34 PM, Chris Metcalf wrote: >> >> On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >>> >>> If a user wants a syscall to kill them, use >>> seccomp. The kernel isn't at fault if the user does a syscall when it >>> didn't want to enter the kernel. >> >> >> Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT >> was to what I wanted here. One concern is that there doesn't seem >> to be a way to "escape" from seccomp strict mode, i.e. you can't >> call seccomp() again to turn it off - which makes sense for seccomp >> since it's a security issue, but not so much sense with cpu_isolated. >> >> So, do you think there's a good role for the seccomp() API to play >> in achieving this goal? It's certainly not a question of "the kernel at >> fault" but rather "asking the kernel to help catch user mistakes" >> (typically third-party libraries in our customers' experience). You >> could imagine a SECCOMP_SET_MODE_ISOLATED or something. >> >> Alternatively, we could stick with the API proposed in my patch >> series, or something similar, and just try to piggy-back on the seccomp >> internals to make it happen. It would require Kconfig to ensure >> that SECCOMP was enabled though, which obviously isn't currently >> required to do cpu isolation. > > > On looking at this again just now, one thing that strikes me is that > it may not be necessary to forbid the syscall like seccomp does. > It may be sufficient just to trigger the task isolation strict signal > and then allow the syscall to complete. After all, we don't "fail" > any of the other things that upset strict mode, like page faults; we > let them complete, but add a signal. So for consistency, I think it > may in fact make sense to simply trigger the signal but let the > syscall do its thing. After all, perhaps the signal is handled > and logged and we don't mind having the application continue; the > signal handler can certainly choose to fail hard, or in the usual > case of no signal handler, that kills the task just fine too. > Allowing the syscall to complete is really kind of incidental. No, don't do that. First, if you have a signal pending, a lot of syscalls will abort with -EINTR. Second, if you fire a signal on entry via sigreturn, you're not going to like the results. > > After the changes proposed lower down in this email, this call > site will end up looking like: > > /* Generate a task isolation strict signal if necessary. */ > if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict()) > task_isolation_syscall(regs->orig_ax); > > But do let me know if you think more specifically that there is > a way seccomp can be helpful. Let task isolation users who want to detect when they screw up and do a syscall do it with seccomp. > > >>> work &= ~_TIF_NOHZ; >>> } >>> #endif >>> @@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs >>> *regs, u32 cached_flags) >>> if (cached_flags & _TIF_USER_RETURN_NOTIFY) >>> fire_user_return_notifiers(); >>> >>> + task_isolation_enter(); >>> + >>> /* Disable IRQs and retry */ >>> local_irq_disable(); >>> >>> cached_flags = >>> READ_ONCE(pt_regs_to_thread_info(regs)->flags); >>> >>> - if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) >>> + if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) && >>> + task_isolation_ready()) >>> break; >>> >>> } >>> } >>> >>> +#ifdef CONFIG_TASK_ISOLATION >>> +# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | >>> _TIF_NOHZ) >>> +#else >>> +# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS >>> +#endif >>> + >> >> TIF_NOHZ is already a hack and IMO this just makes it worse. At the >> very least this should have a comment. It really ought to be either a >> static_branch or a flag that's actually switched per cpu. >> But this is also a confusing change. Don't override the definition >> here -- stick it in the header where it belongs. > > > The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header. > > The more I look at this, though, the more I think it would be cleanest > to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have > enabled task isolation. That can be used unconditionally to check to see > if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(), > and also means that non-task-isolation tasks don't need to go into the > exit loop unconditionally, which is obviously a performance drag for them. Sounds better to me. --Andy