Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756639AbeAIM5l (ORCPT + 1 other); Tue, 9 Jan 2018 07:57:41 -0500 Received: from wtarreau.pck.nerim.net ([62.212.114.60]:39114 "EHLO 1wt.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753136AbeAIM5j (ORCPT ); Tue, 9 Jan 2018 07:57:39 -0500 From: Willy Tarreau To: linux-kernel@vger.kernel.org, x86@kernel.org Cc: Willy Tarreau , Andy Lutomirski , Borislav Petkov , Brian Gerst , Dave Hansen , Ingo Molnar , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Josh Poimboeuf , "H. Peter Anvin" , Kees Cook Subject: [RFC PATCH v2 0/6] Per process PTI activation Date: Tue, 9 Jan 2018 13:56:14 +0100 Message-Id: <1515502580-12261-1-git-send-email-w@1wt.eu> X-Mailer: git-send-email 2.8.0.rc2.1.gbe9624a Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: So here comes the second version after the first round of comments. As suggested, I dropped the thread_info flag and placed it in the mm_struct instead. There's now a per_cpu variable that can be checked in the entry code to decide whether or not to switch CR3. It's important to note that the new flag is lost upon execve(). I think that this provides a better guarantee against any accidental use (eg: a program calling some external helpers once in a while), but it also means we can't use a wrapper anymore and have to modify the executable. I continue to think that a mixed approach consisting in having a specific flag that is only applied upon next execve() call and dropped could be nice, but for now I'm not really sure how to do this cleanly. Regarding the _PAGE_NX change, for now I didn't touch it. I like Andy's approach consisting in changing it dynamically after the first page fault caused by the return to userspace. I just don't know how to do that for now. I've split the entry code changes in two. The first part only updates the kernel entry code to avoid updating CR3 if it already points to a kernel PGD. The second one adds the flag check when going back to userspace. This allowed me to check if the CR3-only changes brought any benefit, but I failed to detect any improvement with that alone for now, including on a preempt kernel. With this patch, when haproxy starts with "arch_prctl(0x1022, 1)", the performance drop compared to booting with "pti=off" is only ~1% and more or less within measurement noise. For now I've left the prctl to retrieve the current value as it helped during debugging, though I think it should disappear before the final version as it provides very little value. Here are the numbers I'm seeing in the various situations for a few tests on a hardware machine (core i7-4790K), numbers are in connections per second, with the performance ratio compared to pti=off between parenthesis : TEST(*) reject reject+acl forward ---------------+-------------+---------------+---------------- pti=off 444k (100%) 252k (100%) 83k (100%) pti=on 382k (86%) 195k (77%) 71k (85%) pti=on+prctl 439k (99%) 249k (99%) 83k (100%) *: tests: "reject" : reject rule, accept(), setsockopt() and close() "reject+acl" : acl-based rule, does extra syscalls (getsockname(), getsockopt, 2 setsockopt, recv, shutdown) "forward" : connection forwarded to remote server, much heavier It's interesting to node that the rule employing a few more syscalls without adding much userspace work is obviously more impacted by PTI. We have a total of 8 syscalls per connection on the middle one and the difference is important. Willy Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Ingo Molnar Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Josh Poimboeuf Cc: "H. Peter Anvin" Cc: Kees Cook Willy Tarreau (6): x86/mm: add a pti_disable entry in mm_context_t x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI x86/pti: add a per-cpu variable pti_disable x86/pti: don't mark the user PGD with _PAGE_NX. x86/entry/pti: avoid setting CR3 when it's already correct x86/entry/pti: don't switch PGD on when pti_disable is set arch/x86/entry/calling.h | 25 +++++++++++++++++++++++++ arch/x86/include/asm/mmu.h | 4 ++++ arch/x86/include/uapi/asm/prctl.h | 3 +++ arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++ arch/x86/mm/pti.c | 2 ++ 5 files changed, 58 insertions(+) -- 1.7.12.1