MIME-Version: 1.0
In-Reply-To: <56D895EA.1060301@mellanox.com>
References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com>
 <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <CALCETrX6wJHC_yBGy7H6LamHTvGf8x+Fjqi6P2jxUBZ7GBp0AQ@mail.gmail.com>
 <56D895EA.1060301@mellanox.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Thu, 3 Mar 2016 15:46:30 -0800
Message-ID: <CALCETrUrc_LJyLJLHefSDYagCrNqqzKuknr6uLgVXnPW8PmZKw@mail.gmail.com>
Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality
To: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>, Christoph Lameter <cl@linux.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Viresh Kumar <viresh.kumar@linaro.org>, Ingo Molnar <mingo@kernel.org>,
        Steven Rostedt <rostedt@goodmis.org>, Tejun Heo <tj@kernel.org>,
        Gilad Ben Yossef <giladb@ezchip.com>,
        Will Deacon <will.deacon@arm.com>, Rik van Riel <riel@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Peter Zijlstra <peterz@infradead.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6848
Lines: 169

On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 03/02/2016 07:36 PM, Andy Lutomirski wrote:
>>
>> On Mar 2, 2016 12:10 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>>>
>>> In prepare_exit_to_usermode(), call task_isolation_ready()
>>> when we are checking the thread-info flags, and after we've handled
>>> the other work, call task_isolation_enter() unconditionally.
>>>
>>> In syscall_trace_enter_phase1(), we add the necessary support for
>>> strict-mode detection of syscalls.
>>>
>>> We add strict reporting for the kernel exception types that do
>>> not result in signals, namely non-signalling page faults and
>>> non-signalling MPX fixups.
>>>
>>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
>>> ---
>>>   arch/x86/entry/common.c | 18 ++++++++++++++++--
>>>   arch/x86/kernel/traps.c |  2 ++
>>>   arch/x86/mm/fault.c     |  2 ++
>>>   3 files changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>>> index 03663740c866..27c71165416b 100644
>>> --- a/arch/x86/entry/common.c
>>> +++ b/arch/x86/entry/common.c
>>> @@ -21,6 +21,7 @@
>>>   #include <linux/context_tracking.h>
>>>   #include <linux/user-return-notifier.h>
>>>   #include <linux/uprobes.h>
>>> +#include <linux/isolation.h>
>>>
>>>   #include <asm/desc.h>
>>>   #include <asm/traps.h>
>>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
>>> pt_regs *regs, u32 arch)
>>>           */
>>>          if (work & _TIF_NOHZ) {
>>>                  enter_from_user_mode();
>>> +               if (task_isolation_check_syscall(regs->orig_ax)) {
>>> +                       regs->orig_ax = -1;
>>> +                       return 0;
>>> +               }
>>
>> This needs a comment indicating the intended semantics.
>> And I've still heard no explanation of why this part can't use seccomp.
>
>
> Here's an excerpt from my earlier reply to you from:
>
>   https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com
>
> Admittedly this patch series has been moving very slowly through
> review, so it's not surprising we have to revisit some things!
>
> On 07/21/2015 03:34 PM, Chris Metcalf wrote:
>>
>> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>>
>>> If a user wants a syscall to kill them, use
>>> seccomp.  The kernel isn't at fault if the user does a syscall when it
>>> didn't want to enter the kernel.
>>
>>
>> Interesting!  I didn't realize how close SECCOMP_SET_MODE_STRICT
>> was to what I wanted here.  One concern is that there doesn't seem
>> to be a way to "escape" from seccomp strict mode, i.e. you can't
>> call seccomp() again to turn it off - which makes sense for seccomp
>> since it's a security issue, but not so much sense with cpu_isolated.
>>
>> So, do you think there's a good role for the seccomp() API to play
>> in achieving this goal?  It's certainly not a question of "the kernel at
>> fault" but rather "asking the kernel to help catch user mistakes"
>> (typically third-party libraries in our customers' experience).  You
>> could imagine a SECCOMP_SET_MODE_ISOLATED or something.
>>
>> Alternatively, we could stick with the API proposed in my patch
>> series, or something similar, and just try to piggy-back on the seccomp
>> internals to make it happen.  It would require Kconfig to ensure
>> that SECCOMP was enabled though, which obviously isn't currently
>> required to do cpu isolation.
>
>
> On looking at this again just now, one thing that strikes me is that
> it may not be necessary to forbid the syscall like seccomp does.
> It may be sufficient just to trigger the task isolation strict signal
> and then allow the syscall to complete.  After all, we don't "fail"
> any of the other things that upset strict mode, like page faults; we
> let them complete, but add a signal.  So for consistency, I think it
> may in fact make sense to simply trigger the signal but let the
> syscall do its thing.  After all, perhaps the signal is handled
> and logged and we don't mind having the application continue; the
> signal handler can certainly choose to fail hard, or in the usual
> case of no signal handler, that kills the task just fine too.
> Allowing the syscall to complete is really kind of incidental.

No, don't do that.  First, if you have a signal pending, a lot of
syscalls will abort with -EINTR.  Second, if you fire a signal on
entry via sigreturn, you're not going to like the results.


>
> After the changes proposed lower down in this email, this call
> site will end up looking like:
>
>         /* Generate a task isolation strict signal if necessary. */
>         if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict())
>                 task_isolation_syscall(regs->orig_ax);
>
> But do let me know if you think more specifically that there is
> a way seccomp can be helpful.

Let task isolation users who want to detect when they screw up and do
a syscall do it with seccomp.

>
>
>>>                  work &= ~_TIF_NOHZ;
>>>          }
>>>   #endif
>>> @@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs
>>> *regs, u32 cached_flags)
>>>                  if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>>>                          fire_user_return_notifiers();
>>>
>>> +               task_isolation_enter();
>>> +
>>>                  /* Disable IRQs and retry */
>>>                  local_irq_disable();
>>>
>>>                  cached_flags =
>>> READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>>>
>>> -               if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>>> +               if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
>>> +                   task_isolation_ready())
>>>                          break;
>>>
>>>          }
>>>   }
>>>
>>> +#ifdef CONFIG_TASK_ISOLATION
>>> +# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS |
>>> _TIF_NOHZ)
>>> +#else
>>> +# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
>>> +#endif
>>> +
>>
>> TIF_NOHZ is already a hack and IMO this just makes it worse.  At the
>> very least this should have a comment.  It really ought to be either a
>> static_branch or a flag that's actually switched per cpu.
>> But this is also a confusing change.  Don't override the definition
>> here -- stick it in the header where it belongs.
>
>
> The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header.
>
> The more I look at this, though, the more I think it would be cleanest
> to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have
> enabled task isolation.  That can be used unconditionally to check to see
> if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(),
> and also means that non-task-isolation tasks don't need to go into the
> exit loop unconditionally, which is obviously a performance drag for them.

Sounds better to me.

--Andy