Date: Thu, 11 Jan 2018 18:40:25 +0100
From: Willy Tarreau <w@1wt.eu>
To: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
        Borislav Petkov <bp@alien8.de>,
        Brian Gerst <brgerst@gmail.com>,
        Ingo Molnar <mingo@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Josh Poimboeuf <jpoimboe@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Kees Cook <keescook@chromium.org>
Subject: Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when
 pti_disable is set
Message-ID: <20180111174025.GB15344@1wt.eu>
References: <1515502580-12261-1-git-send-email-w@1wt.eu>
 <1515502580-12261-7-git-send-email-w@1wt.eu>
 <20180110082207.GX29822@worktop.programming.kicks-ass.net>
 <20180110091102.GH14066@1wt.eu>
 <CALCETrXQFL2sJPJe_Z8XLEo11V_tLyZR0y4PTRxJzrhmpdsuJg@mail.gmail.com>
 <CA+55aFwfSKNAYEFWCwCe2vSxejA3X85FnX9t1EdnCRUwC1ou1Q@mail.gmail.com>
 <20180111064259.GC14920@1wt.eu>
 <0f08d89e-61e1-20e3-5c59-0b2f7b32bf0c@linux.intel.com>
 <20180111154412.GA15296@1wt.eu>
 <CALCETrVcQg_1opnvOP4ksOAC07K4O_LTSxy2czwtObwR3YL+-w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALCETrVcQg_1opnvOP4ksOAC07K4O_LTSxy2czwtObwR3YL+-w@mail.gmail.com>
User-Agent: Mutt/1.6.1 (2016-04-27)
Sender: linux-kernel-owner@vger.kernel.org

Hi Andy!

On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau <w@1wt.eu> wrote:
> >> So, I'd do this:
> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too)
> >> 2. Enabled for an entire process (not thread)
> >> 3. Inherited across fork/exec
> >> 4. Cleared on setuid() and friends
> >
> > This one causes me a problem : some daemons already take care of dropping
> > privileges after the initial fork() for the sake of security. Haproxy
> > typically does this at boot :
> >
> >    - parse config
> >    - chroot to /var/empty
> >    - setuid(dedicated_uid)
> >    - fork()
> >
> > This ensures the process is properly isolated and hard enough to break out
> > of. So I'd really like this setuid() not to anihilate all we've done.
> > Probably that we want to drop it on suid binaries however, though I'm
> > having doubts about the benefits, because if the binary already allows
> > an intruder to inject its own meltdown code, you're quite screwed anyway.
> >
> >> 5. I'm sure the security folks have/want a way to force it on forever
> >
> > Sure! That's what I implemented using the sysctl.
> >
> 
> All of these proposals have serious issues.  For example, suppose I
> have a setuid program called nopti that works like this:
> 
> $ nopti some_program
> 
> nopti verifies that some_program is trustworthy and runs it (as the
> real uid of nopti's user) with PTI off.  Now we have all the usual
> problems: you can easily break out using ptrace(), for example.  And
> LD_PRELOAD gets this wrong.  Et.

Fine, we can drop it on suid binaries as I mentionned.

> So I think that no-pti mode is a privilege as opposed to a mode per
> se.  If you can turn off PTI, then you have the ability to read all of
> kernel memory  So maybe we should treat it as such.  Add a capability
> CAP_DISABLE_PTI.  If you have that capability (globally), then you can
> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
> If you lose the cap, you lose no-pti mode as well.

I disagree on this, because the only alternative I have is to decide
to keep my process running as root, which is even worse, as root can
much more easily escape from a chroot jail for example, or access
/dev/mem and read all the memory as well. Also tell Linus that he'll
have to build his kernels as root ;-)

The arch_prctl() calls I proposed only allow to turn PTI off for
privileged users but any user can turn it back on. For me it's
important. A process might trust itself for its own use, but such
processes will rarely trust external processes in case they need to
perform an occasional fork+exec. Imagine for example a static web
server requiring to achieve the highest possible performance and
having to occasionally call logrotate to rotate+compress the logs.
It's better if the process knows how to turn PTI back on before
calling this.

> If an LSM wants to block it, it can use existing mechanisms.
> 
> As for per-mm vs per-thread, let's make it only switchable in
> single-threaded processes for now and inherited when threads are
> created.

That's exactly what it does for now, but Linus doesn't like it at all.
So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
point regarding the pgd and _PAGE_NX setting. point Now at least we know
the change is minimal if we have a good reason for doing differently
later.

> We can change that if and when demand for the ability to
> change it shows up.
> 
> (Another reason for per-thread instead of per-mm: as a per-mm thing,
> you can't set it up for your descendents using vfork(); prctl();
> exec(), and the latter is how your average language runtime that
> spawns subprocesses would want to do it.

That's indeed the benefit it provides for now since I actually had
to *add* code to execve() to disable it then.

Cheers,
Willy