Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935480AbeAKRxt (ORCPT + 1 other); Thu, 11 Jan 2018 12:53:49 -0500 Received: from mail.kernel.org ([198.145.29.99]:48422 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934973AbeAKRxs (ORCPT ); Thu, 11 Jan 2018 12:53:48 -0500 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9B8732175D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org X-Google-Smtp-Source: ACJfBosDpNOI3SJz+ozlVUt/ZzKbLnVohLfEnvzH7dOQfurHvvvtRpD9rlmUnssRPmxqIGVus7S5WIfmtRekMkEwy5I= MIME-Version: 1.0 In-Reply-To: <20180111174025.GB15344@1wt.eu> References: <1515502580-12261-1-git-send-email-w@1wt.eu> <1515502580-12261-7-git-send-email-w@1wt.eu> <20180110082207.GX29822@worktop.programming.kicks-ass.net> <20180110091102.GH14066@1wt.eu> <20180111064259.GC14920@1wt.eu> <0f08d89e-61e1-20e3-5c59-0b2f7b32bf0c@linux.intel.com> <20180111154412.GA15296@1wt.eu> <20180111174025.GB15344@1wt.eu> From: Andy Lutomirski Date: Thu, 11 Jan 2018 09:53:26 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set To: Willy Tarreau Cc: Andy Lutomirski , Dave Hansen , Linus Torvalds , Peter Zijlstra , LKML , X86 ML , Borislav Petkov , Brian Gerst , Ingo Molnar , Thomas Gleixner , Josh Poimboeuf , "H. Peter Anvin" , Greg Kroah-Hartman , Kees Cook Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Thu, Jan 11, 2018 at 9:40 AM, Willy Tarreau wrote: > Hi Andy! > > On Thu, Jan 11, 2018 at 09:09:14AM -0800, Andy Lutomirski wrote: >> On Thu, Jan 11, 2018 at 7:44 AM, Willy Tarreau wrote: >> >> So, I'd do this: >> >> 1. Do the arch_prctl() (but ask the ARM guys what they want too) >> >> 2. Enabled for an entire process (not thread) >> >> 3. Inherited across fork/exec >> >> 4. Cleared on setuid() and friends >> > >> > This one causes me a problem : some daemons already take care of dropping >> > privileges after the initial fork() for the sake of security. Haproxy >> > typically does this at boot : >> > >> > - parse config >> > - chroot to /var/empty >> > - setuid(dedicated_uid) >> > - fork() >> > >> > This ensures the process is properly isolated and hard enough to break out >> > of. So I'd really like this setuid() not to anihilate all we've done. >> > Probably that we want to drop it on suid binaries however, though I'm >> > having doubts about the benefits, because if the binary already allows >> > an intruder to inject its own meltdown code, you're quite screwed anyway. >> > >> >> 5. I'm sure the security folks have/want a way to force it on forever >> > >> > Sure! That's what I implemented using the sysctl. >> > >> >> All of these proposals have serious issues. For example, suppose I >> have a setuid program called nopti that works like this: >> >> $ nopti some_program >> >> nopti verifies that some_program is trustworthy and runs it (as the >> real uid of nopti's user) with PTI off. Now we have all the usual >> problems: you can easily break out using ptrace(), for example. And >> LD_PRELOAD gets this wrong. Et. > > Fine, we can drop it on suid binaries as I mentionned. > >> So I think that no-pti mode is a privilege as opposed to a mode per >> se. If you can turn off PTI, then you have the ability to read all of >> kernel memory So maybe we should treat it as such. Add a capability >> CAP_DISABLE_PTI. If you have that capability (globally), then you can >> use the arch_prctl() or regular prctl() or whatever to turn PTI on. >> If you lose the cap, you lose no-pti mode as well. > > I disagree on this, because the only alternative I have is to decide > to keep my process running as root, which is even worse, as root can > much more easily escape from a chroot jail for example, or access > /dev/mem and read all the memory as well. Also tell Linus that he'll > have to build his kernels as root ;-) Not since Linux 4.3 :) You can set CAP_DISABLE_PTI as an "ambient" capability and let it be inherited by all descendents even if unprivileged. This was all very carefully designed so that a program that inherited an ambient capability but tries to run a less-privileged child using pre-4.3 techniques will correctly drop the ambient capability, which is *exactly* what we want for PTI. So I stand by my suggestion. Linus could still do: $ nopti make -j512 and have it work, but trying to ptrace() the make process from outside the nopti process tree (and without doing nopti ptrace) will fail, as expected. (Unless root does the ptrace, again as expected.) > > The arch_prctl() calls I proposed only allow to turn PTI off for > privileged users but any user can turn it back on. For me it's > important. A process might trust itself for its own use, but such > processes will rarely trust external processes in case they need to > perform an occasional fork+exec. Imagine for example a static web > server requiring to achieve the highest possible performance and > having to occasionally call logrotate to rotate+compress the logs. > It's better if the process knows how to turn PTI back on before > calling this. In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants the ability to have PTI off. If you have PTI off, you can turn it back in using prctl() or whatever. So you call prctl() (to turn PTI back on) or capset() (to turn it on and drop the ability to turn it off). How exactly do you plan to efficiently call logrotate from your webserver if the flag is per-mm, though? You don't want to fork() in your fancy server because fork() write-protects the whole address space. So you use clone() or vfork() (like posix_spawn() does internally on a sane libc), but now you can't turn PTI back on if it's per-mm because you haven't changed your mm. I really really think it should be per thread. > >> If an LSM wants to block it, it can use existing mechanisms. >> >> As for per-mm vs per-thread, let's make it only switchable in >> single-threaded processes for now and inherited when threads are >> created. > > That's exactly what it does for now, but Linus doesn't like it at all. > So I'll switch it back to per-mm + per-CPU variable. Well he has a valid > point regarding the pgd and _PAGE_NX setting. point Now at least we know > the change is minimal if we have a good reason for doing differently > later. Yuck, I hate this. Are you really going to wire it up complete with all the IPIs needed to get the thing synced up right? it's also going to run slower no matter what, I think, because you'll have to sync the per-mm state back to the TI flags on context switches. Linus, can you explain again why you think PTI should be a per-mm thing? That is, why do you think it's useful and why do you think it makes logical sense from a user's POV? I think the implementation is easier and saner for per-thread. Also, if I we use a capability bit for it, making it per-mm gets really weird. --Andy