Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753400AbbBYM1M (ORCPT ); Wed, 25 Feb 2015 07:27:12 -0500 Received: from bombadil.infradead.org ([198.137.202.9]:46552 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753332AbbBYM1K (ORCPT ); Wed, 25 Feb 2015 07:27:10 -0500 Date: Wed, 25 Feb 2015 13:27:01 +0100 From: Peter Zijlstra To: Andi Kleen Cc: Andi Kleen , x86@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/3] x86: Move msr accesses out of line Message-ID: <20150225122701.GK5029@twins.programming.kicks-ass.net> References: <1424482737-958-1-git-send-email-andi@firstfloor.org> <20150223170436.GC5029@twins.programming.kicks-ass.net> <20150223174340.GD27767@tassilo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150223174340.GD27767@tassilo.jf.intel.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5452 Lines: 140 On Mon, Feb 23, 2015 at 09:43:40AM -0800, Andi Kleen wrote: > On Mon, Feb 23, 2015 at 06:04:36PM +0100, Peter Zijlstra wrote: > > On Fri, Feb 20, 2015 at 05:38:55PM -0800, Andi Kleen wrote: > > > > > This patch moves the MSR functions out of line. A MSR access is typically > > > 40-100 cycles or even slower, a call is a few cycles at best, so the > > > additional function call is not really significant. > > > > If I look at the below PDF a CALL+PUSH EBP+MOV RSP,RBP+ ... +POP+RET > > ends up being 5+1.5+0.5+ .. + 1.5+8 = 16.5 + .. cycles. > > You cannot just add up the latency cycles. The CPU runs all of this > in parallel. > > Latency cycles would only be interesting if these instructions were > on the critical path for computing the result, which they are not. > > It should be a few cycles overhead. I thought that since CALL touches RSP, PUSH touches RSP, MOV RSP, (obviously) touches RSP, POP touches RSP and well, RET does too. There were strong dependencies on the instructions and there would be little room to parallelize things. I'm glad you so patiently educated me on the wonders of modern architectures and how it can indeed do all this in parallel. Still, I wondered, so I ran me a little test. Note that I used a serializing instruction (LOCK XCHG) because WRMSR is too. I see a ~14 cycle difference between the inline and noinline version. If I substitute the LOCK XCHG with XADD, I get to 1,5 cycles in difference, so clearly there is some magic happening, but serializing instructions wreck it. Anybody can explain how such RSP deps get magiced away? --- root@ivb-ep:~# cat call.c #define __always_inline inline __attribute__((always_inline)) #define noinline __attribute__((noinline)) static int #ifdef FOO noinline #else __always_inline #endif xchg(int *ptr, int val) { asm volatile ("LOCK xchgl %0, %1\n" : "+r" (val), "+m" (*(ptr)) : : "memory", "cc"); return val; } void main(void) { int val = 0, old; for (int i = 0; i < 1000000000; i++) old = xchg(&val, i); } root@ivb-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -DFOO -o call call.c root@ivb-ep:~# objdump -D call | awk '/<[^>]*>:/ {p=0} /

:/ {p=1} /:/ {p=1} { if (p) print $0 }' 00000000004003e0

: 4003e0: 55 push %rbp 4003e1: 48 89 e5 mov %rsp,%rbp 4003e4: 53 push %rbx 4003e5: 31 db xor %ebx,%ebx 4003e7: 48 83 ec 18 sub $0x18,%rsp 4003eb: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp) 4003f2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 4003f8: 48 8d 7d e0 lea -0x20(%rbp),%rdi 4003fc: 89 de mov %ebx,%esi 4003fe: 83 c3 01 add $0x1,%ebx 400401: e8 fa 00 00 00 callq 400500 400406: 81 fb 00 ca 9a 3b cmp $0x3b9aca00,%ebx 40040c: 75 ea jne 4003f8 40040e: 48 83 c4 18 add $0x18,%rsp 400412: 5b pop %rbx 400413: 5d pop %rbp 400414: c3 retq 0000000000400500 : 400500: 55 push %rbp 400501: 89 f0 mov %esi,%eax 400503: 48 89 e5 mov %rsp,%rbp 400506: f0 87 07 lock xchg %eax,(%rdi) 400509: 5d pop %rbp 40050a: c3 retq 40050b: 90 nop 40050c: 90 nop 40050d: 90 nop 40050e: 90 nop 40050f: 90 nop root@ivb-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -o call-inline call.c root@ivb-ep:~# objdump -D call-inline | awk '/<[^>]*>:/ {p=0} /

:/ {p=1} /:/ {p=1} { if (p) print $0 }' 00000000004003e0

: 4003e0: 55 push %rbp 4003e1: 31 c0 xor %eax,%eax 4003e3: 48 89 e5 mov %rsp,%rbp 4003e6: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp) 4003ed: 0f 1f 00 nopl (%rax) 4003f0: 89 c2 mov %eax,%edx 4003f2: f0 87 55 f0 lock xchg %edx,-0x10(%rbp) 4003f6: 83 c0 01 add $0x1,%eax 4003f9: 3d 00 ca 9a 3b cmp $0x3b9aca00,%eax 4003fe: 75 f0 jne 4003f0 400400: 5d pop %rbp 400401: c3 retq root@ivb-ep:~# perf stat -e "cycles:u" ./call Performance counter stats for './call': 36,309,274,162 cycles:u 10.561819310 seconds time elapsed root@ivb-ep:~# perf stat -e "cycles:u" ./call-inline Performance counter stats for './call-inline': 22,004,045,745 cycles:u 6.498271508 seconds time elapsed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/