MIME-Version: 1.0
In-Reply-To: <56DDE9C9.5060900@mellanox.com>
References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com>
 <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <CALCETrX6wJHC_yBGy7H6LamHTvGf8x+Fjqi6P2jxUBZ7GBp0AQ@mail.gmail.com>
 <56D895EA.1060301@mellanox.com> <CALCETrUrc_LJyLJLHefSDYagCrNqqzKuknr6uLgVXnPW8PmZKw@mail.gmail.com>
 <56DDE9C9.5060900@mellanox.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Mon, 7 Mar 2016 12:55:24 -0800
Message-ID: <CALCETrUrP+gZsDLChMi5ZbT-TkD4gXvMZQt+iun2EYHipcuxHQ@mail.gmail.com>
Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality
To: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>, Christoph Lameter <cl@linux.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Viresh Kumar <viresh.kumar@linaro.org>, Ingo Molnar <mingo@kernel.org>,
        Steven Rostedt <rostedt@goodmis.org>, Tejun Heo <tj@kernel.org>,
        Gilad Ben Yossef <giladb@ezchip.com>,
        Will Deacon <will.deacon@arm.com>, Rik van Riel <riel@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Peter Zijlstra <peterz@infradead.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4777
Lines: 104

On Mon, Mar 7, 2016 at 12:51 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 03/03/2016 06:46 PM, Andy Lutomirski wrote:
>>
>> On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf <cmetcalf@mellanox.com>
>> wrote:
>>>
>>> On 03/02/2016 07:36 PM, Andy Lutomirski wrote:
>>>>
>>>> On Mar 2, 2016 12:10 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>>>>>
>>>>> In prepare_exit_to_usermode(), call task_isolation_ready()
>>>>> when we are checking the thread-info flags, and after we've handled
>>>>> the other work, call task_isolation_enter() unconditionally.
>>>>>
>>>>> In syscall_trace_enter_phase1(), we add the necessary support for
>>>>> strict-mode detection of syscalls.
>>>>> [...]
>>>>>
>>>>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
>>>>> pt_regs *regs, u32 arch)
>>>>>            */
>>>>>           if (work & _TIF_NOHZ) {
>>>>>                   enter_from_user_mode();
>>>>> +               if (task_isolation_check_syscall(regs->orig_ax)) {
>>>>> +                       regs->orig_ax = -1;
>>>>> +                       return 0;
>>>>> +               }
>>>>
>>>> This needs a comment indicating the intended semantics.
>>>> And I've still heard no explanation of why this part can't use seccomp.
>>>
>>>
>>> Here's an excerpt from my earlier reply to you from:
>>>
>>>    https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com
>>>
>>> Admittedly this patch series has been moving very slowly through
>>> review, so it's not surprising we have to revisit some things!
>>>
>>> On 07/21/2015 03:34 PM, Chris Metcalf wrote:
>>>>
>>>> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>>>>
>>>>> If a user wants a syscall to kill them, use
>>>>> seccomp.  The kernel isn't at fault if the user does a syscall when it
>>>>> didn't want to enter the kernel.
>>>>
>>>>
>>>> Interesting!  I didn't realize how close SECCOMP_SET_MODE_STRICT
>>>> was to what I wanted here.  One concern is that there doesn't seem
>>>> to be a way to "escape" from seccomp strict mode, i.e. you can't
>>>> call seccomp() again to turn it off - which makes sense for seccomp
>>>> since it's a security issue, but not so much sense with cpu_isolated.
>>>>
>>>> So, do you think there's a good role for the seccomp() API to play
>>>> in achieving this goal?  It's certainly not a question of "the kernel at
>>>> fault" but rather "asking the kernel to help catch user mistakes"
>>>> (typically third-party libraries in our customers' experience).  You
>>>> could imagine a SECCOMP_SET_MODE_ISOLATED or something.
>>>>
>>>> Alternatively, we could stick with the API proposed in my patch
>>>> series, or something similar, and just try to piggy-back on the seccomp
>>>> internals to make it happen.  It would require Kconfig to ensure
>>>> that SECCOMP was enabled though, which obviously isn't currently
>>>> required to do cpu isolation.
>>>
>>>
>>> On looking at this again just now, one thing that strikes me is that
>>> it may not be necessary to forbid the syscall like seccomp does.
>>> It may be sufficient just to trigger the task isolation strict signal
>>> and then allow the syscall to complete.  After all, we don't "fail"
>>> any of the other things that upset strict mode, like page faults; we
>>> let them complete, but add a signal.  So for consistency, I think it
>>> may in fact make sense to simply trigger the signal but let the
>>> syscall do its thing.  After all, perhaps the signal is handled
>>> and logged and we don't mind having the application continue; the
>>> signal handler can certainly choose to fail hard, or in the usual
>>> case of no signal handler, that kills the task just fine too.
>>> Allowing the syscall to complete is really kind of incidental.
>>
>> No, don't do that.  First, if you have a signal pending, a lot of
>> syscalls will abort with -EINTR.  Second, if you fire a signal on
>> entry via sigreturn, you're not going to like the results.
>
>
> OK, you've convinced me to stick with the previous model of just
> forbidding the syscall in this case.
>
>> Let task isolation users who want to detect when they screw up and do
>> a syscall do it with seccomp.
>
>
> Can you give me more details on what you're imagining here?  Remember
> that a key use case is that these applications can remove the syscall
> prohibition voluntarily; it's only there to prevent unintended uses
> (by third party libraries or just straight-up programming bugs).
> As far as I can tell, seccomp does not allow you to go from "less
> permissive" to "more permissive" settings at all, which means that as
> it exists, it's not a good solution for this use case.
>
> Or were you thinking about a new seccomp API that allows this?

I was.  This is at least the second time I've wanted a way to ask
seccomp to allow a layer to be removed.