Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933715AbeAKPpP (ORCPT + 1 other); Thu, 11 Jan 2018 10:45:15 -0500 Received: from wtarreau.pck.nerim.net ([62.212.114.60]:39589 "EHLO 1wt.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933238AbeAKPpN (ORCPT ); Thu, 11 Jan 2018 10:45:13 -0500 Date: Thu, 11 Jan 2018 16:44:12 +0100 From: Willy Tarreau To: Dave Hansen Cc: Linus Torvalds , Andy Lutomirski , Peter Zijlstra , LKML , X86 ML , Borislav Petkov , Brian Gerst , Ingo Molnar , Thomas Gleixner , Josh Poimboeuf , "H. Peter Anvin" , Greg Kroah-Hartman , Kees Cook Subject: Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set Message-ID: <20180111154412.GA15296@1wt.eu> References: <1515502580-12261-1-git-send-email-w@1wt.eu> <1515502580-12261-7-git-send-email-w@1wt.eu> <20180110082207.GX29822@worktop.programming.kicks-ass.net> <20180110091102.GH14066@1wt.eu> <20180111064259.GC14920@1wt.eu> <0f08d89e-61e1-20e3-5c59-0b2f7b32bf0c@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0f08d89e-61e1-20e3-5c59-0b2f7b32bf0c@linux.intel.com> User-Agent: Mutt/1.6.1 (2016-04-27) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: Hi Dave, On Thu, Jan 11, 2018 at 07:29:30AM -0800, Dave Hansen wrote: > I don't think we need a "NOW" and "NEXT" mode, at least initially. The > "NEXT" semantics are going to be tricky and I think "NOW" is good enough In fact I thought the NEXT one would bring us a nice benefit which is that we start the new process knowing the flag's value so we can decide whether or not to apply _PAGE_NX on the pgd from the start, and never touch it anymore. > Whatever we do, we'll need this PTI-disable flag to be able cross > exeve() so that a wrapper a la nice(1) work. Absolutely! > Initially, I think the > default should be that it survives fork(). There are just too many > things out there that "start up" by doing a shell script that calls a > python script, that calls a... Not only that, simply daemons, like most services are! > Without the wrapper support, we're _basically_ stuck using this only in > newly-compiled binaries. That's going to make it much less likely to > get used. I know, that's why I kept considering that option despite not really needing it for my own use case. > The inheritance also gives an app a way to re-enable protections for > children, just from a _second_ wrapper. That's nice because it means we > don't initially need a "NEXT" ABI. > > So, I'd do this: > 1. Do the arch_prctl() (but ask the ARM guys what they want too) > 2. Enabled for an entire process (not thread) > 3. Inherited across fork/exec > 4. Cleared on setuid() and friends This one causes me a problem : some daemons already take care of dropping privileges after the initial fork() for the sake of security. Haproxy typically does this at boot : - parse config - chroot to /var/empty - setuid(dedicated_uid) - fork() This ensures the process is properly isolated and hard enough to break out of. So I'd really like this setuid() not to anihilate all we've done. Probably that we want to drop it on suid binaries however, though I'm having doubts about the benefits, because if the binary already allows an intruder to inject its own meltdown code, you're quite screwed anyway. > 5. I'm sure the security folks have/want a way to force it on forever Sure! That's what I implemented using the sysctl. > Next, if we decide that we have things that both don't want PTI's > protections and are forking things not covered by #4, we can add some > "child opt out" in the prctl(), plus maybe marking binaries somehow. I was really thinking about using the "NOW" for this compard to the NEXT. But I don't know what it could imply for the pgd not having the _PAGE_NX. > Please don't forget to add ways to tell if this feature is on/off in > /proc or whatever. Very good idea, and it will be much more convenient than using the GET prctl that I didn't like. > I think we also need to be able to dump the actual > CR3 value that we entered the kernel with before we start doing too much > other funky stuff with the entry code. When you say dump, you mean save it somewhere in a per_cpu variable ? Willy