Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933008AbbHYQ3V (ORCPT ); Tue, 25 Aug 2015 12:29:21 -0400 Received: from mail-ob0-f178.google.com ([209.85.214.178]:36047 "EHLO mail-ob0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932952AbbHYQ3S (ORCPT ); Tue, 25 Aug 2015 12:29:18 -0400 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Tue, 25 Aug 2015 09:28:57 -0700 Message-ID: Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup To: Brian Gerst Cc: X86 ML , Denys Vlasenko , Borislav Petkov , Linus Torvalds , "linux-kernel@vger.kernel.org" , Jan Beulich Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5193 Lines: 130 On Tue, Aug 25, 2015 at 3:59 AM, Brian Gerst wrote: > On Mon, Aug 24, 2015 at 5:13 PM, Andy Lutomirski wrote: >> >> We could also annotate with syscalls need full regs and jump to the >> slow path for them. This would leave the fast path unchanged (we >> could duplicate the sys call table so that regs-requiring syscalls >> would turn into some asm that switches to the slow path). We'd make >> the syscall table say something like: >> >> 59 64 execve sys_execve:regs >> >> The fast path would have exactly identical performance and the slow >> path would presumably speed up. The down side would be additional >> complexity. > > I don't think it is worth it to optimize the syscalls that need full > pt_regs (which are generally quite expensive and less frequently used) > at the expense of every other syscall. > > What kind of cleanups, other than just removing the stubs, would this > allow? Is there more code you plan to move to C? This isn't about optimizing the regs-using syscalls at all -- it's about simplifying all the other ones and optimizing the slow path. The way that the regs-using syscalls currently work is that the entry in the syscall table expects to see rbx, rbp, and r12-r15 in *registers* and it shoves them into pt_regs and pulls them back out. This means that we pretty much have to call syscalls from asm, which precludes the straightforward re-implementation of the whole slow path as: void do_slow_syscall(...) { enter_from_user_mode(); fixup_arg5 [if compat fast syscall]; seccomp, etc; if (nr < max) call the syscall; exit tracing; prepare_return_to_usermode(); } I bet that, with a bit of tweaking, that would actually end up faster than what we do right now for everything except fully fast-path syscalls. This would also be a *huge* sanity improvement for the compat case in which the args are currently jumbled in asm. It would become: if (nr < max) call the syscall(regs->bx, regs->cx, regs->dx, ...); which completely avoids the unreadable and probably buggy mess we have now. We could just get rid of the compat fast path entirely -- I would be a bit surprised if anyone cared about a couple cycles for compat, but I don't think it's a great idea long-term to have the compat path fully written in C but the native 64-bit path partially in asm. My concrete idea here is to have two 64-bit syscall tables: fast and slow. The slow table would point to the real C functions for all syscalls. The fast table would be the same except for the syscalls that use regs; for those syscalls it would point to: GLOBAL(stub_switch_to_slow_path_64) popq %r11 /* discard return address */ movq %rbp, RBP(%rsp), etc; jmp entry_SYSCALL_64_slow_path END(stub_switch_to_slow_path_64) so that the regs-using syscalls take the slow path no matter what. This doesn't even require autogenerated stubs, since they can all share the same stub. Now the 64-bit fast path can stay more or less the same (we'd reorder the first flags test and the subq $(6*8), %rsp), and the slow path can be almost all in C. Then I can back out the two-phase entry tracing thing, and after *that*, muahaha, I can dust off some languishing seccomp improvements I have that are incompatible with two-phase entry tracing. (I have a half-written test case to exercise the dark corners of syscall args and tracing. So far it catches a bug in SYSCALL32 that was apparently never fixed (which makes me wonder why signal-heavy workloads work on AMD systems in compat mode), but I haven't extended it enough to catch the R9 thing.) > >> Thing 2: vdso compilation with binutils that doesn't support .cfi directives >> >> Userspace debuggers really like having the vdso properly >> CFI-annotated, and the 32-bit fast syscall entries are annotatied >> manually in hexidecimal. AFAIK Jan Beulich is the only person who >> understands it. >> >> I want to be able to change the entries a little bit to clean them up >> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which >> currently suck), but it's really, really messy right now because of >> the hex CFI stuff. Could we just drop the CFI annotations if the >> binutils version is too old or even just require new enough binutils >> to build 32-bit and compat kernels? > > One thing I want to do is rework the 32-bit VDSO into a single image, > using alternatives to handle the selection of entry method. The > open-coded CFI crap has made that near impossible to do. > Yes please! But please don't change the actual instruction ordering at all yet, since the SYSCALL case seems to be buggy right now. (If you want to be really fancy, don't use alternatives. Instead teach vdso2c to annotate the actual dynamic table function pointers so we can rewrite the pointers at boot time. That will save a cycle or two.) --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/