Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755456AbbFRLA2 (ORCPT ); Thu, 18 Jun 2015 07:00:28 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51463 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755268AbbFRK7M (ORCPT ); Thu, 18 Jun 2015 06:59:12 -0400 Message-ID: <5582A47B.4020802@redhat.com> Date: Thu, 18 Jun 2015 12:59:07 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Ingo Molnar CC: Linus Torvalds , Steven Rostedt , Borislav Petkov , "H. Peter Anvin" , Andy Lutomirski , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , x86@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads References: <1433876051-26604-1-git-send-email-dvlasenk@redhat.com> <1433876051-26604-4-git-send-email-dvlasenk@redhat.com> <20150614084059.GA24562@gmail.com> <557D9BEE.8010902@redhat.com> <20150615202008.GA12450@gmail.com> <557F6CC3.7070709@redhat.com> <20150618093134.GA1094@gmail.com> In-Reply-To: <20150618093134.GA1094@gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3615 Lines: 92 On 06/18/2015 11:31 AM, Ingo Molnar wrote: >> If it takes, say, 20 cycles to pull data from e.g. L3 cache to ECX, >> then SYSRET can't possibly complete sooner than in 20 cycles. > > Yeah, that's true, but my point is: SYSRET has to do a lot of other things > (permission checks, loading the user mode state - most of which are unrelated to > R11/RCX), which take dozens of cycles, SYSRET was designed to avoid doing that. It does not check permissions - it slam-dunks CPL3 and resets CS and SS to preset values. It does not touch stack register or restores any other GP register. Having said that, I'd try to get cold hard facts, i.e. experimentally measure SYSRET latency. > and which are probably overlapped with any > cache misses on arguments such as R11/RCX. > > It's not impossible that reordering helps, for example if SYSRET has some internal > dependencies that makes it parallelism worse than ideal - but I'd complicate this > code only if it gives a measurable improvement for cache-cold syscall performance. I attempted to test it. With the patch which moves RCX and R11 loads all the way down: diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index f2064bd..0ea09a3 100644 --- a/arch/x86/entry/entry_64_compat.S +++ b/arch/x86/entry/entry_64_compat.S @@ -139,9 +139,6 @@ sysexit_from_sys_call: * with 'sysenter' and it uses the SYSENTER calling convention. */ andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS) - /* Prepare registers for SYSRET insn */ - movl RIP(%rsp), %ecx /* User %eip */ - movl EFLAGS(%rsp), %r11d /* User eflags * /* Restore registers per SYSEXIT ABI requirements: */ /* arg1 (ebx): preserved by virtue of being a callee-saved register */ /* arg2 (ecx): used by SYSEXIT to restore esp (and by SYSRET to restore eip) */ @@ -155,6 +152,9 @@ sysexit_from_sys_call: xorl %r8d, %r8d xorl %r9d, %r9d xorl %r10d, %r10d + /* Prepare registers for SYSRET insn */ + movl RIP(%rsp), %ecx /* User %eip */ + movl EFLAGS(%rsp), %r11d /* User eflags * TRACE_IRQS_ON /* @@ -374,9 +374,6 @@ cstar_dispatch: sysretl_from_sys_call: andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS) - /* Prepare registers for SYSRET insn */ - movl RIP(%rsp), %ecx /* User %eip */ - movl EFLAGS(%rsp), %r11d /* User eflags */ /* Restore registers per SYSRET ABI requirements: */ /* arg1 (ebx): preserved by virtue of being a callee-saved register */ /* arg2 (ebp): preserved (already restored, see above) */ @@ -388,6 +385,9 @@ sysretl_from_sys_call: xorl %r8d, %r8d xorl %r9d, %r9d xorl %r10d, %r10d + /* Prepare registers for SYSRET insn */ + movl RIP(%rsp), %ecx /* User %eip */ + movl EFLAGS(%rsp), %r11d /* User eflags */ TRACE_IRQS_ON movl RSP(%rsp), %esp /* This does not change instructions sizes and therefore code cacheline alignments over entire bzImage. Testing getpid() in a loop (IOW: cache-hot test) did show that with this patch it is slower, but by statistically insignificant amount: before patch, it's 61.92 ns per syscall. after patch, it's 61.99 ns per syscall. That's less than one cycle, more like 0.15 cycles. However, it is reproducible. I did not figure out how to do a cache-cold test. Tried a 65kbyte-ish read from "/dev/zero". That takes ~3885 ns and its variability of +-10 ns drowns out a possible difference. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/