DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9B8732175D
MIME-Version: 1.0
In-Reply-To: <20180111174025.GB15344@1wt.eu>
References: <1515502580-12261-1-git-send-email-w@1wt.eu> <1515502580-12261-7-git-send-email-w@1wt.eu>
 <20180110082207.GX29822@worktop.programming.kicks-ass.net>
 <20180110091102.GH14066@1wt.eu> <CALCETrXQFL2sJPJe_Z8XLEo11V_tLyZR0y4PTRxJzrhmpdsuJg@mail.gmail.com>
 <CA+55aFwfSKNAYEFWCwCe2vSxejA3X85FnX9t1EdnCRUwC1ou1Q@mail.gmail.com>
 <20180111064259.GC14920@1wt.eu> <0f08d89e-61e1-20e3-5c59-0b2f7b32bf0c@linux.intel.com>
 <20180111154412.GA15296@1wt.eu> <CALCETrVcQg_1opnvOP4ksOAC07K4O_LTSxy2czwtObwR3YL+-w@mail.gmail.com>
 <20180111174025.GB15344@1wt.eu>
From: Andy Lutomirski <luto@kernel.org>
Date: Thu, 11 Jan 2018 09:53:26 -0800
Message-ID: <CALCETrWb7esaT03RPyRqOzUPkYZ3Hxz1M_WwwkGtStwfYn3ehg@mail.gmail.com>
Subject: Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when
 pti_disable is set
To: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
        Borislav Petkov <bp@alien8.de>,
        Brian Gerst <brgerst@gmail.com>,
        Ingo Molnar <mingo@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Josh Poimboeuf <jpoimboe@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Kees Cook <keescook@chromium.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org

On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Andy!
>
> On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <w@1wt.eu> wrote:
>> >> So, I'd do this:
>> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
>> >> 2. Enabled for an entire process (not thread)
>> >> 3. Inherited across fork/exec
>> >> 4. Cleared on setuid() and friends
>> >
>> > This one causes me a problem : some daemons already take care of dropping
>> > privileges after the initial fork() for the sake of security. Haproxy
>> > typically does this at boot :
>> >
>> >    - parse config
>> >    - chroot to /var/empty
>> >    - setuid(dedicated_uid)
>> >    - fork()
>> >
>> > This ensures the process is properly isolated and hard enough to break out
>> > of. So I'd really like this setuid() not to anihilate all we've done.
>> > Probably that we want to drop it on suid binaries however, though I'm
>> > having doubts about the benefits, because if the binary already allows
>> > an intruder to inject its own meltdown code, you're quite screwed anyway.
>> >
>> >> 5. I'm sure the security folks have/want a way to force it on forever
>> >
>> > Sure! That's what I implemented using the sysctl.
>> >
>>
>> All of these proposals have serious issues.  For example, suppose I
>> have a setuid program called nopti that works like this:
>>
>> $ nopti some_program
>>
>> nopti verifies that some_program is trustworthy and runs it (as the
>> real uid of nopti's user) with PTI off.  Now we have all the usual
>> problems: you can easily break out using ptrace(), for example.  And
>> LD_PRELOAD gets this wrong.  Et.
>
> Fine, we can drop it on suid binaries as I mentionned.
>
>> So I think that no-pti mode is a privilege as opposed to a mode per
>> se.  If you can turn off PTI, then you have the ability to read all of
>> kernel memory  So maybe we should treat it as such.  Add a capability
>> CAP_DISABLE_PTI.  If you have that capability (globally), then you can
>> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
>> If you lose the cap, you lose no-pti mode as well.
>
> I disagree on this, because the only alternative I have is to decide
> to keep my process running as root, which is even worse, as root can
> much more easily escape from a chroot jail for example, or access
> /dev/mem and read all the memory as well. Also tell Linus that he'll
> have to build his kernels as root ;-)

Not since Linux 4.3 :)  You can set CAP_DISABLE_PTI as an "ambient"
capability and let it be inherited by all descendents even if
unprivileged.  This was all very carefully designed so that a program
that inherited an ambient capability but tries to run a
less-privileged child using pre-4.3 techniques will correctly drop the
ambient capability, which is *exactly* what we want for PTI.

So I stand by my suggestion.  Linus could still do:

$ nopti make -j512

and have it work, but trying to ptrace() the make process from outside
the nopti process tree (and without doing nopti ptrace) will fail, as
expected.  (Unless root does the ptrace, again as expected.)

>
> The arch_prctl() calls I proposed only allow to turn PTI off for
> privileged users but any user can turn it back on. For me it's
> important. A process might trust itself for its own use, but such
> processes will rarely trust external processes in case they need to
> perform an occasional fork+exec. Imagine for example a static web
> server requiring to achieve the highest possible performance and
> having to occasionally call logrotate to rotate+compress the logs.
> It's better if the process knows how to turn PTI back on before
> calling this.

In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants
the ability to have PTI off.  If you have PTI off, you can turn it
back in using prctl() or whatever.  So you call prctl() (to turn PTI
back on) or capset() (to turn it on and drop the ability to turn it
off).

How exactly do you plan to efficiently call logrotate from your
webserver if the flag is per-mm, though?  You don't want to fork() in
your fancy server because fork() write-protects the whole address
space.  So you use clone() or vfork() (like posix_spawn() does
internally on a sane libc), but now you can't turn PTI back on if it's
per-mm because you haven't changed your mm.

I really really think it should be per thread.

>
>> If an LSM wants to block it, it can use existing mechanisms.
>>
>> As for per-mm vs per-thread, let's make it only switchable in
>> single-threaded processes for now and inherited when threads are
>> created.
>
> That's exactly what it does for now, but Linus doesn't like it at all.
> So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
> point regarding the pgd and _PAGE_NX setting. point Now at least we know
> the change is minimal if we have a good reason for doing differently
> later.

Yuck, I hate this.  Are you really going to wire it up complete with
all the IPIs needed to get the thing synced up right?  it's also going
to run slower no matter what, I think, because you'll have to sync the
per-mm state back to the TI flags on context switches.

Linus, can you explain again why you think PTI should be a per-mm
thing?  That is, why do you think it's useful and why do you think it
makes logical sense from a user's POV?  I think the implementation is
easier and saner for per-thread.  Also, if I we use a capability bit
for it, making it per-mm gets really weird.

--Andy