Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752862AbbBWVIW (ORCPT ); Mon, 23 Feb 2015 16:08:22 -0500 Received: from mx1.redhat.com ([209.132.183.28]:41195 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752158AbbBWVIU (ORCPT ); Mon, 23 Feb 2015 16:08:20 -0500 Message-ID: <54EB96AB.5050808@redhat.com> Date: Mon, 23 Feb 2015 22:07:55 +0100 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Andy Lutomirski , Borislav Petkov , "H. Peter Anvin" CC: X86 ML , LKML , Linus Torvalds Subject: RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK? Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6799 Lines: 183 Hi Andy, Peter, Borislav, Since Andy just took my big patch (whew), further simplifications in entry64.S are possible. I want to discuss one particular change I have in mind. The main 64-bit syscall entry point to kernel is used when userspace executes SYSCALL insn. When that happens, we load %rsp so that there is a space for iret stack on top (5 stack slots), but we don't fully populate it, we only use %rip slot there (pt_regs->rip). Instead of pt_regs->rsp, we save user's %rsp in PER_CPU(old_rsp). Current logic is as follows: SYSCALL insn saves %rip to %rcx, %rflags to %r11. We save %rcx to pt_regs->rip, but save %r11 to its "native" place in pt_regs->r11. On syscall exit, we load pt_regs->rip to %rcx, pt_regs->r11 to %r11, and execute SYSRET. So far so good. Now, the complications. There are cases where we may need to return via IRET insn. This requires full, valid iret stack. In particular, pt_regs->r11 needs to be copied to pt_regs->rflags. Also, we have ptrace poking into pt_regs. It can change pt_regs->rflags, pt_regs->rsp. These changes need to be copied to pt_regs->r11, PER_CPU(old_rsp). Also, ptrace can look at pt_regs->cs/ss, they need to be set to __USER_CS/__USER_DS. Both of these cases are handled by FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK macros, which are ugly: .macro FIXUP_TOP_OF_STACK tmp offset=0 movq PER_CPU_VAR(old_rsp),\tmp movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */ movq R11+\offset(%rsp),\tmp /* get eflags */ movq \tmp,EFLAGS+\offset(%rsp) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 movq RSP+\offset(%rsp),\tmp movq \tmp,PER_CPU_VAR(old_rsp) movq EFLAGS+\offset(%rsp),\tmp movq \tmp,R11+\offset(%rsp) .endm I propose to change (simplify?) it. - First step: save %r11 in pt_regs->rflags (because it really is user's %rflags). With this change, FIXUP/RESTORE macros do not need to copy it to and fro. stub_iopl is not needed (it exists now because it touches pt_regs->rflags). Question: what to do with pt_regs->r11 slot? It is accessible via ptrace, should we zero it out? Save %r11 twice? Don't populate it, and make FIXUP_TOP_OF_STACK do pt_regs->r11 = pt_regs->rflags copy? (We do the latter for rcx now, i.e. FIXUP_TOP_OF_STACK does pt_regs->rcx = pt_regs->rip). - Second: save user's %rsp in pt_regs->rsp, not PER_CPU(old_rsp). In fact, PER_CPU(old_rsp) can't be completely eliminated, but it can be relegated to a purely temporary storage on syscall entry path. All references to it from C code will be gone, thread.usersp field is also gone, a few ugly places are becoming less ugly, they consistently use pt_regs->sp now. Example: unsigned long KSTK_ESP(struct task_struct *task) { - return (test_tsk_thread_flag(task, TIF_IA32)) ? - (task_pt_regs(task)->sp) : ((task)->thread.usersp); + return task_pt_regs(task)->sp; } In entry64.S, this change boils down to this: movq %rsp,PER_CPU_VAR(old_rsp) movq PER_CPU_VAR(kernel_stack),%rsp ... movq $-ENOSYS,RAX(%rsp) movq_cfi rax,(ORIG_RAX) movq %rcx,RIP(%rsp) + movq PER_CPU_VAR(old_rsp),%rcx + movq %rcx,RSP(%rsp) ... ...call syscall_table[i]... ... - movq PER_CPU_VAR(old_rsp), %rsp + movq RSP(%rsp), %rsp USERGS_SYSRET64 FIXUP/RESTORE_TOP_OF_STACK will no longer need to move ->rsp value back and forth. Is this simplification worth the penalty of having two more MOV insns? My testing shows that syscall fast path is ~54.3 ns before and after the patch (on 2.7 GHz Sandy Bridge CPU). - With these two changes, RESTORE_TOP_OF_STACK becomes a no-op (both pt_regs->rsp and pt_regs->rflags are in their rightful places now). FIXUP_TOP_OF_STACK will be reduced to: .macro FIXUP_TOP_OF_STACK tmp offset=0 movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */ movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */ movq \tmp,R11+\offset(%rsp) .endm iow, it now doesn't play around with pt_regs->rsp, and r11<->rflags move is in the opposite direction. - Next possible change is to use PUSH insns to build the stack. Something along the lines of swapgs mov %rsp,%gs:old_rsp mov %gs:kernel_stack,%rsp sti push $__USER_DS /* pt_regs->ss */ push %gs:old_rsp /* ->rsp */ push %r11 /* ->rflags */ push $__USER_CS /* ->cs */ push %rcx /* ->rip */ push %rax /* ->orig_rax */ push %rdi push %rsi push %rdx push %rcx push $-ENOSYS /* ->rax */ push %r8 push %r9 push %r10 push %r11 sub $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */ This populates pt_regs->ss/cs, making FIXUP_TOP_OF_STACK even less complex. Also, pushes are 1 or 2-byte insns. For comparison, current code looks like this in disassembly: 120: 0f 01 f8 swapgs 123: 65 48 89 24 25 00 00 mov %rsp,%gs:old_rsp 12a: 00 00 12c: 65 48 8b 24 25 00 00 mov %gs:kernel_stack,%rsp 133: 00 00 135: fb sti 136: 48 81 ec 80 00 00 00 sub $0x80,%rsp 13d: 4c 89 5c 24 30 mov %r11,0x30(%rsp) 142: 4c 89 54 24 38 mov %r10,0x38(%rsp) 147: 4c 89 4c 24 40 mov %r9,0x40(%rsp) 14c: 4c 89 44 24 48 mov %r8,0x48(%rsp) 151: 48 89 54 24 60 mov %rdx,0x60(%rsp) 156: 48 89 74 24 68 mov %rsi,0x68(%rsp) 15b: 48 89 7c 24 70 mov %rdi,0x70(%rsp) 160: 48 c7 44 24 50 da ff movq $-ENOSYS,0x50(%rsp) 167: ff ff 169: 48 89 44 24 78 mov %rax,0x78(%rsp) 16e: 48 89 8c 24 80 00 00 mov %rcx,0x80(%rsp) As you see, "MOV const,(mem64)" is a ten-byte insn. "PUSH const" is 2 bytes. Other MOVs are biggish too. However, such prologue performs four more stores that our current one (->cs/ss, and extra pushes for rcx/r11). Is it a show-stopper? Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/