Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965760AbbDJNs2 (ORCPT ); Fri, 10 Apr 2015 09:48:28 -0400 Received: from mail.skyhub.de ([78.46.96.112]:44804 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965106AbbDJNsW (ORCPT ); Fri, 10 Apr 2015 09:48:22 -0400 Date: Fri, 10 Apr 2015 15:46:07 +0200 From: Borislav Petkov To: Ingo Molnar Cc: "Paul E. McKenney" , Linus Torvalds , Jason Low , Peter Zijlstra , Davidlohr Bueso , Tim Chen , Aswin Chandramouleeswaran , LKML , Andy Lutomirski , Denys Vlasenko , Brian Gerst , "H. Peter Anvin" , Thomas Gleixner , Peter Zijlstra Subject: Re: [PATCH] x86: Pack loops tightly as well Message-ID: <20150410134607.GF28074@pd.tnic> References: <20150409183926.GM6464@linux.vnet.ibm.com> <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <20150410121808.GA19918@gmail.com> <20150410123017.GB19918@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20150410123017.GB19918@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2177 Lines: 55 On Fri, Apr 10, 2015 at 02:30:18PM +0200, Ingo Molnar wrote: > And the final patch below also packs loops tightly: > > text data bss dec filename > 12566391 1617840 1089536 15273767 vmlinux.align.16-byte > 12224951 1617840 1089536 14932327 vmlinux.align.1-byte > 11976567 1617840 1089536 14683943 vmlinux.align.1-byte.funcs-1-byte > 11903735 1617840 1089536 14611111 vmlinux.align.1-byte.funcs-1-byte.loops-1-byte > > The total reduction is 5.5%. > > Now loop alignment is beneficial if: > > - a loop is cache-hot and its surroundings are not. > > Loop alignment is harmful if: > > - a loop is cache-cold > - a loop's surroundings are cache-hot as well > - two cache-hot loops are close to each other > > and I'd argue that the latter three harmful scenarios are much more > common in the kernel. Similar arguments can be made for function > alignment as well. (Jump target alignment is a bit different but I > think the same conclusion holds.) So I IMHO think the loop alignment is coupled to the fetch window size and alignment. I'm looking at the AMD opt. manuals and both for fam 0x15 and 0x16 say that hot loops should be 32-byte aligned due to 32-byte aligned fetch window in each cycle. So if we have hot loops, we probably want them 32-byte aligned (I don't know what that number on Intel is, need to look). Family 0x16 says, in addition, that if you have branches in those loops, the first two branches in a cacheline can be processed in a cycle when they're in the branch predictor. And so to guarantee that you should align your loop start to a cacheline. And this all depends on the uarch so I can imagine optimizing for the one would harm the other. Looks like a long project of experimenting and running perf counters :-) -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/