Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964850AbbDJOyj (ORCPT ); Fri, 10 Apr 2015 10:54:39 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47504 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933638AbbDJOyh (ORCPT ); Fri, 10 Apr 2015 10:54:37 -0400 Message-ID: <5527E3E9.7010608@redhat.com> Date: Fri, 10 Apr 2015 16:53:29 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Borislav Petkov CC: Ingo Molnar , "Paul E. McKenney" , Linus Torvalds , Jason Low , Peter Zijlstra , Davidlohr Bueso , Tim Chen , Aswin Chandramouleeswaran , LKML , Andy Lutomirski , Brian Gerst , "H. Peter Anvin" , Thomas Gleixner , Peter Zijlstra Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries References: <20150409183926.GM6464@linux.vnet.ibm.com> <20150410090051.GA28549@gmail.com> <20150410091252.GA27630@gmail.com> <20150410092152.GA21332@gmail.com> <20150410111427.GA30477@gmail.com> <20150410112748.GB30477@gmail.com> <20150410120846.GA17101@gmail.com> <20150410131929.GE28074@pd.tnic> <5527D631.4090905@redhat.com> <20150410140141.GI28074@pd.tnic> In-Reply-To: <20150410140141.GI28074@pd.tnic> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3294 Lines: 73 On 04/10/2015 04:01 PM, Borislav Petkov wrote: > On Fri, Apr 10, 2015 at 03:54:57PM +0200, Denys Vlasenko wrote: >> On 04/10/2015 03:19 PM, Borislav Petkov wrote: >>> On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote: >>>> Now, the usual justification for jump target alignment is the >>>> following: with 16 byte instruction-cache cacheline sizes, if a >>> >>> You mean 64 bytes? >>> >>> Cacheline size on modern x86 is 64 bytes. The 16 alignment is probably >>> some branch predictor stride thing. >> >> IIRC it's a maximum decode bandwidth. Decoders on the most powerful >> x86 CPUs, both Intel and AMD, attempt to decode in one cycle >> up to four instructions. For this they fetch up to 16 bytes. > > 32 bytes fetch window per cycle for AMD F15h and F16h, see my other > mail. And Intel probably do the same. There are people who experimentally researched this. According to this guy: http://www.agner.org/optimize/microarchitecture.pdf Intel CPUs can decode only up to 16 bytes at a time (but the have loop buffers and some has uop cache, which can skip decoding entirely). AMD CPUs can decode 21 bytes at best. With two cores active, only 16 bytes. """ 10 Haswell pipeline ... 10.1 Pipeline The pipeline is similar to previous designs, but improved with more of everything. It is designed for a throughput of four instructions per clock cycle. Each core has a reorder buffer with 192 entries, the reservation station has 60 entries, and the register file has 168 integer registers and 168 vector registers, according to the literature listed on page 145 below. All parts of the pipeline are shared between two threads in those CPU models that can run two threads in each core. Each thread gets half of the total throughput when two threads are running in the same core. 10.2 Instruction fetch and decoding The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle in single threaded applications. There are four decoders, which can handle instructions generating up to four μops per clock cycle in the way described on page 120 for Sandy Bridge. Instructions with any number of prefixes are decoded in a single clock cycle. There is no penalty for redundant prefixes. ... ... 15 AMD Bulldozer, Piledriver and Steamroller pipeline 15.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller ... 15.2 Instruction fetch The instruction fetcher is shared between the two cores of an execution unit. The instruction fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache. The measured fetch rate was up to 16 bytes per clock per core when two cores were active, and up to 21 bytes per clock in linear code when only one core was active. The fetch rate is lower than these maximum values when instructions are misaligned. Critical subroutine entries and loop entries should not start near the end of a 32-bytes block. You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in the first four instructions after a critical label. """ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/