Date: Fri, 10 Apr 2015 14:08:46 +0200
From: Ingo Molnar <mingo@kernel.org>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Jason Low <jason.low2@hp.com>, Peter Zijlstra <peterz@infradead.org>,
        Davidlohr Bueso <dave@stgolabs.net>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        Aswin Chandramouleeswaran <aswin@hp.com>,
        LKML <linux-kernel@vger.kernel.org>, Borislav Petkov <bp@alien8.de>,
        Andy Lutomirski <luto@amacapital.net>,
        Denys Vlasenko <dvlasenk@redhat.com>, Brian Gerst <brgerst@gmail.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: [PATCH] x86: Align jump targets to 1 byte boundaries
Message-ID: <20150410120846.GA17101@gmail.com>
References: <CA+55aFz6KKxGVxPAbsmw9GsKJfy85P2C0EmYBrGpn+aJDjZJWw@mail.gmail.com>
 <20150409175652.GI6464@linux.vnet.ibm.com>
 <CA+55aFzXMDjQQ7jTjsPdh1RikXfgV7OCd-+13cz06MOmDBA33w@mail.gmail.com>
 <CA+55aFwZWi6ecDmVsMBQJTrgrW3GD2DaRtpiOspe=5amR1=dNg@mail.gmail.com>
 <20150409183926.GM6464@linux.vnet.ibm.com>
 <20150410090051.GA28549@gmail.com>
 <20150410091252.GA27630@gmail.com>
 <20150410092152.GA21332@gmail.com>
 <20150410111427.GA30477@gmail.com>
 <20150410112748.GB30477@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150410112748.GB30477@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4130
Lines: 104


* Ingo Molnar <mingo@kernel.org> wrote:

> So restructure the loop a bit, to get much tighter code:
> 
> 0000000000000030 <mutex_spin_on_owner.isra.5>:
>   30:	55                   	push   %rbp
>   31:	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
>   38:	00 00
>   3a:	48 89 e5             	mov    %rsp,%rbp
>   3d:	48 39 37             	cmp    %rsi,(%rdi)
>   40:	75 1e                	jne    60 <mutex_spin_on_owner.isra.5+0x30>
>   42:	8b 46 28             	mov    0x28(%rsi),%eax
>   45:	85 c0                	test   %eax,%eax
>   47:	74 0d                	je     56 <mutex_spin_on_owner.isra.5+0x26>
>   49:	f3 90                	pause
>   4b:	48 8b 82 10 c0 ff ff 	mov    -0x3ff0(%rdx),%rax
>   52:	a8 08                	test   $0x8,%al
>   54:	74 e7                	je     3d <mutex_spin_on_owner.isra.5+0xd>
>   56:	31 c0                	xor    %eax,%eax
>   58:	5d                   	pop    %rbp
>   59:	c3                   	retq
>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
>   60:	b8 01 00 00 00       	mov    $0x1,%eax
>   65:	5d                   	pop    %rbp
>   66:	c3                   	retq

Btw., totally off topic, the following NOP caught my attention:

>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

That's a dead NOP that boats the function a bit, added for the 16 byte 
alignment of one of the jump targets.

I realize that x86 CPU manufacturers recommend 16-byte jump target 
alignments (it's in the Intel optimization manual), but the cost of 
that is very significant:

        text           data       bss         dec      filename
    12566391        1617840   1089536    15273767      vmlinux.align.16-byte
    12224951        1617840   1089536    14932327      vmlinux.align.1-byte

By using 1 byte jump target alignment (i.e. no alignment at all) we 
get an almost 3% reduction in kernel size (!) - and a probably similar 
reduction in I$ footprint.

So I'm wondering, is the 16 byte jump target optimization suggestion 
really worth this price? The patch below boots fine and I've not 
measured any noticeable slowdown, but I've not tried hard.

Now, the usual justification for jump target alignment is the 
following: with 16 byte instruction-cache cacheline sizes, if a 
forward jump is aligned to cacheline boundary then prefetches will 
start from a new cacheline.

But I think that argument is flawed for typical optimized kernel code 
flows: forward jumps often go to 'cold' (uncommon) pieces of code, and 
aligning cold code to cache lines does not bring a lot of advantages 
(they are uncommon), while it causes collateral damage:

 - their alignment 'spreads out' the cache footprint, it shifts 
   followup hot code further out

 - plus it slows down even 'cold' code that immediately follows 'hot' 
   code (like in the above case), which could have benefited from the 
   partial cacheline that comes off the end of hot code.

What do you guys think about this? I think we should seriously 
consider relaxing our alignment defaults.

Thanks,

	Ingo

==================================>
>From 5b83a095e1abdfee5c710c34a5785232ce74f939 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 13:50:05 +0200
Subject: [PATCH] x86: Align jumps targets to 1 byte boundaries

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..0366d6b44a14 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,6 +77,9 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+	# Align jump targets to 1 byte, not the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/