Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752441AbeADRzb (ORCPT + 1 other); Thu, 4 Jan 2018 12:55:31 -0500 Received: from mta-out1.inet.fi ([62.71.2.226]:34019 "EHLO johanna1.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751682AbeADRz3 (ORCPT ); Thu, 4 Jan 2018 12:55:29 -0500 X-Greylist: delayed 377 seconds by postgrey-1.27 at vger.kernel.org; Thu, 04 Jan 2018 12:55:29 EST RazorGate-KAS: Rate: 10 RazorGate-KAS: {HEADERS: 7-bit header Content-Type found with 8-bit header Content-Transfer-Encoding} RazorGate-KAS: Envelope from: RazorGate-KAS: Version: 5.5.3 RazorGate-KAS: LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29 RazorGate-KAS: Method: none RazorGate-KAS: Lua profiles 69136 [Nov 12 2014] RazorGate-KAS: Status: not_detected Date: Thu, 4 Jan 2018 19:49:10 +0200 From: "Kalle A. Sandstrom" To: linux-kernel@vger.kernel.org Subject: RFD: Fastpath amelioration of the KAISER/KPTI performance impact Message-ID: <20180104174910.GA23675@molukki> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: [presented with intent to amuse and edumacate, here's a little something something for the current performance crisis.] --- cut here --- Fastpath amelioration of the KAISER fixes' performance impact in Linux. Kalle A. Sandstr?m, 20180104 [DRAFT VERSION 0: not for publication. not even for serious consideration; v0 should be read as an elaborate joke.] ABSTRACT. This document identifies an opportunity for clawing back some of the performance penalty from the KAISER/KPTI security patch by means of fast-pathing interprocess communication in the section of code that'd otherwise trampoline kernel entry. Two possible designs to this end are briefly outlined. The designs presented are for the very worst case where microcode updates don't appear, or are restricted to new CPU models, and consequently KAISER/KPTI is here to stay for a hojillion people. All of this may be a terrible idea. Caveat lector; a good argument can be made in favour of not looking into the abyss. SYNOPSIS. Increase the constant function fragment's footprint to handle some forms of task switching and inter-process communication without enabling the kernel proper, thereby halving the number of TLB flushes over some IPC roundtrips. The IPC mechanism might be something as involved as a reimplementation of most POSIX I/O, or as minimal as a rendezvous synchronization primitive combined with existing shared memory gubbins. Distinguish this from the ``big'' kernel with filesystems, MM, block devices, and anything with an infinite memory requirement; which is stashed behind the extra TLB flush. Call the intermediary an ittybittykernel. *rimshot* While most performance gains from this general approach should happen early on, higher-hanging fruit will be available for a long time to come, so the CFF is expected to grow indefinitely. It could be foreseen that there'll be a long-term game of cat and mouse between the speculative information leak finder and the perpetually-appointed security engineer, providing both with long-term careers in computational esoterica. This design document presents a speculative development path towards such a Frankenstein's architecture as well as a first step along that path, ultimately motivated by the prospect of recovering some of that putative 20% performance penalty. On the downside, even the best result will still be worse than a hypothetical microkernel system written from scratch, but only until the CPU manufacturers repair their emissions: after that monolithic will rule microbenchmarks once more (on new hardware, and chips where a microcode fix is available & yields a lesser penalty). BACKGROUND. The KAISER patch makes the kernel invulnerable to the speculative address space probing feature of certain Intel processors (the ``Meltdown'' vulnerability). It accomplishes this at the cost of a TLB flush coming and going per syscall, which brings their minimum number over the shortest possible inter-process roundtrip to 4. This is a heavy performance cost in applications where out-of-process computation doesn't dominate TLB reload overhead. It could even be said that in terms of performance, KAISER turns Linux into the worst possible microkernel system: one where exactly no services are provided by the intermediate layer but all of a monolithic design's downsides are retained, leaving the intermediary's introduction a step for the worse from all perspectives besides security. PROPOSAL. Instead of having the kernel mapped into each process and serving syscalls etc. directly, the KAISER patch changes the kernel interface to an analogue of what's used in 4G/4G mode. That's to say, it forwards kernel entry via a set of IDT and syscall trampolines over the TLB flush boundary into what's effectively a separate kernel address space. The simplified rationale is that since the region containing the trampolines is small and its contents easily audited for security issues, this prevents both leakage of useful information regarding kernel address space layout randomization, and (consequently) the utilization of speculative kernel information leak vulnerabilities without (an)other ASLR leak(s). The proposal at hand amounts to an increase in the footprint of this ``constant function fragment'' to the end that communication between the X server and its clients wouldn't suffer double TLB flushes. Two distinct means are proposed: the first is a conservative reimplementation of a subset of POSIX file descriptor and process management, and UNIX domain sockets; and the second a simplistic ``shared memory with rendezvous sync'' primitive coupled with fiddly business in the C library and a legacy fallback. Regardless of design particulars, the additional code's presence is justified by being eventually fully auditable for both KASLR information leaks and exploitable speculative-execution gadgets. That's to say: its security follows from limited scope and legions of hungry twentysomethings over however many years it takes. DESIGN. [Imagine a convincing argument here about how UNIX domain sockets require much of UNIX to operate properly, and how that path from one process to the next isn't gonna fit in a small enough binary to audit properly. It would be compatible with POSIX I/O though.] A rendezvous signaling mechanism would identify senders and recipients by cookie tied to a thread identity and an address space defined by big-kernel memory management, rendering it able to switch processes and drop to the big kernel for scheduling, but little else. Its requirements are first, that user contexts involved in rendezvous signaling be available to CFF space; second, that the CFF is able to switch between address spaces without dropping to the big kernel; third, that communication security is managed by the big kernel in the form of ahead-of-time checking; and fourth, that a magical unicorn will take care of multiprocessing details. The rest is left as an exercise for the reader. On the downside, a thin mechanism like that would require reimplementation of UNIX domain sockets in some compatible way, mainly in userspace, and under similar security demands if it's not to be worse in that regard than the alternative. It's also guaranteed not to interoperate with any select(2)-like syscall due to its separation from POSIX; consequently threads receiving data via that mechanism will only be able to pass it to threads involved in POSIX I/O using heavyweight syscalls. There's also a bunch of interrupt, timeout, lifecycle, etc. semantics that're going unexplored here. It may be that one of these mechanisms offers more benefit to various pre-flip IPI etc. handlers in the CFF, than the other. IMPLEMENTATION (USER SPACE). Legacy static binaries, and those without a recent C library, will continue to use the syscall interface, dropping into the big kernel as a matter of course. New and compatible binaries will link to a C library which implements some cases of POSIX I/O in terms of the fastpath IPC mechanism by means of unicorn farts & kitten giggles. IMPLEMENTATION (KERNEL). [Imagine a fancy data structure for storing IPC endpoint cookies and so forth without exposing their contents through a Meltdown gadget.] [Imagine elaborate whatsits for sharing some pages between the CFF and the big kernel for e.g. user context management, inter-processor interactions, and so forth.] [Imagine the legwork it takes to get the job done. phew! what a slog.] TESTING. Just run whatever old bollocks on top, see how it breaks. The usual. CONCLUSION. Of course I'm having a giggle; effluent just became actual! But for how long? --- cut here --- -KS