Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752881AbbDCODp (ORCPT ); Fri, 3 Apr 2015 10:03:45 -0400 Received: from mail-wg0-f48.google.com ([74.125.82.48]:35941 "EHLO mail-wg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752334AbbDCODm (ORCPT ); Fri, 3 Apr 2015 10:03:42 -0400 Date: Fri, 3 Apr 2015 16:03:36 +0200 From: Ingo Molnar To: Denys Vlasenko Cc: Linus Torvalds , Steven Rostedt , Borislav Petkov , "H. Peter Anvin" , Andy Lutomirski , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , x86@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter Message-ID: <20150403140336.GA16422@gmail.com> References: <1428059634-11782-1-git-send-email-dvlasenk@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1428059634-11782-1-git-send-email-dvlasenk@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8977 Lines: 227 * Denys Vlasenko wrote: > Interrupt entry points are handled with the following code: > Each 32-byte code block contains seven entry points > > ... > [push][jump 22] // 4 bytes > [push][jump 18] // 4 bytes > [push][jump 14] // 4 bytes > [push][jump 10] // 4 bytes > [push][jump 6] // 4 bytes > [push][jump 2] // 4 bytes > [push][jump common_interrupt][padding] // 8 bytes > > [push][jump] > [push][jump] > [push][jump] > [push][jump] > [push][jump] > [push][jump] > [push][jump common_interrupt][padding] > > [padding_2] > common_interrupt: > > The first six jumps are short (2-byte insns) jumps to the last > jump in the block. The last jump usually has to be a longer, 5-byte form. > > There are about 30 such blocks. > > This change uses the fact that last few of these blocks are close to > "common_interrupt" label and the last jump there can also be a short one. > > This allows several last 32-byte code blocks to contain eight, not seven entry points, > and to have all these entry points jump directly to common_interrupt, > eliminating one branch. They look like this: > > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > [push][jump common_interrupt] // 4 bytes > > This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table > and before common_interrupt with 32-byte alignment. > > The first alignment was completely unnecessary: > the dispatch table will not benefit from any alignment bigger than 32 bytes > (any entry point needs at most only 32 bytes of table to be read by the CPU). > > The alignment before common_interrupt label affects the size of padding > (the "[padding_2]" above) and excessive padding reduces the number of blocks > which can be optimized. In the .config I tested with, this alignment was 64 bytes, > this means two fewer blocks could be optimized. I believe 32-byte alignment > should do just fine. > > The condition which controls the choice of a direct jump versus short one, > > /* Can we reach common_interrupt with a short jump? */ > .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) > > was tested to be a passimistic one: the generated code will not erroneously > generate a longer than necessary jump even with maximum possible padding > and with -1 subtracted from the right hand side above. > This was verified by checking disassembly. > > However, with current value of FIRST_SYSTEM_VECTOR it so happens > that the padding before "common_interrupt" is zero bytes. > > Disassembly before After > ========================================== ================================== > ... ... > 840 6a9b push $0xffffffffffffff9b 82c 6a9d push $0xffffffffffffff9d > 842 eb16 jmp 85a 82e eb30 jmp 860 > 844 6a9a push $0xffffffffffffff9a 830 6a9c push $0xffffffffffffff9c > 846 eb12 jmp 85a 832 eb2c jmp 860 > 848 6a99 push $0xffffffffffffff99 834 6a9b push $0xffffffffffffff9b > 84a eb0e jmp 85a 836 eb28 jmp 860 > 84c 6a98 push $0xffffffffffffff98 838 6a9a push $0xffffffffffffff9a > 84e eb0a jmp 85a 83a eb24 jmp 860 > 850 6a97 push $0xffffffffffffff97 83c 6a99 push $0xffffffffffffff99 > 852 eb06 jmp 85a 83e eb20 jmp 860 > 854 6a96 push $0xffffffffffffff96 840 6a98 push $0xffffffffffffff98 > 856 eb02 jmp 85a 842 eb1c jmp 860 > 858 6a95 push $0xffffffffffffff95 844 6a97 push $0xffffffffffffff97 > 85a eb24 jmp 880 846 eb18 jmp 860 > 85c 0f1f4000 nopl 0x0(%rax) 848 6a96 push $0xffffffffffffff96 > 860 6a94 push $0xffffffffffffff94 84a eb14 jmp 860 > 862 eb0c jmp 870 84c 6a95 push $0xffffffffffffff95 > 864 6a93 push $0xffffffffffffff93 84e eb10 jmp 860 > 866 eb08 jmp 870 850 6a94 push $0xffffffffffffff94 > 868 6a92 push $0xffffffffffffff92 852 eb0c jmp 860 > 86a eb04 jmp 870 854 6a93 push $0xffffffffffffff93 > 86c 6a91 push $0xffffffffffffff91 856 eb08 jmp 860 > 86e eb00 jmp 870 858 6a92 push $0xffffffffffffff92 > 870 eb0e jmp 880 85a eb04 jmp 860 > 872 66666666662e0f data32 data32 data32 data3285c 6a91 push $0xffffffffffffff91 > 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1) 85e eb00 jmp 860 > 880 : 860 : > > Sizes: > text data bss dec hex filename > 12578 0 0 12578 3122 entry_64.o.before > 12546 0 0 12546 3102 entry_64.o > > Signed-off-by: Denys Vlasenko > CC: Linus Torvalds > CC: Steven Rostedt > CC: Ingo Molnar > CC: Borislav Petkov > CC: "H. Peter Anvin" > CC: Andy Lutomirski > CC: Oleg Nesterov > CC: Frederic Weisbecker > CC: Alexei Starovoitov > CC: Will Drewry > CC: Kees Cook > CC: x86@kernel.org > CC: linux-kernel@vger.kernel.org > --- > arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 49 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index da137f9..2003417 100644 > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -537,33 +537,70 @@ END(ret_from_fork) > * Build the entry stubs and pointer table with some assembler magic. > * We pack 7 stubs into a single 32-byte chunk, which will fit in a > * single cache line on all modern x86 implementations. > + * The last few cachelines pack 8 stubs each. > */ > +ALIGN_common_interrupt=32 > .section .init.rodata,"a" > ENTRY(interrupt) > .section .entry.text > - .p2align 5 > - .p2align CONFIG_X86_L1_CACHE_SHIFT > + .align 32 > ENTRY(irq_entries_start) > INTR_FRAME > vector=FIRST_EXTERNAL_VECTOR > -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7 > - .balign 32 > +.rept 256 /* this number does not need to be exact, just big enough */ > + > + /* > + * Block of six "push + short_jump" pairs, 4 bytes each, > + * and a 2-byte seventh "push", without its jump yet. > + */ > +need_near_jump=0 > +push_count=0 > .rept 7 > .if vector < FIRST_SYSTEM_VECTOR > - .if vector <> FIRST_EXTERNAL_VECTOR > +1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ > + .previous > + .quad 1b > + .section .entry.text > +vector=vector+1 > +push_count=push_count+1 > + .if push_count < 7 > + /* Can we reach common_interrupt with a short jump? */ > + .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) > + jmp common_interrupt /* yes */ > + .else > + jmp 2f > +need_near_jump=1 > + .endif > CFI_ADJUST_CFA_OFFSET -8 > .endif > + .endif > + .endr > + > + /* If we are close to the end, we can pack in yet another pair */ > + .if need_near_jump == 0 > + .if push_count == 7 > + /* The "short jump" for the seventh "push" */ > + jmp common_interrupt > + CFI_ADJUST_CFA_OFFSET -8 > + .endif > + .if vector < FIRST_SYSTEM_VECTOR > + /* "push + short_jump" pair #8 */ > 1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ > - .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6 > - jmp 2f > - .endif > - .previous > + .previous > .quad 1b > - .section .entry.text > + .section .entry.text > vector=vector+1 > +push_count=push_count+1 > + jmp common_interrupt > + CFI_ADJUST_CFA_OFFSET -8 > .endif > - .endr > + .else > + /* Use remaining space for "near jump" for the seventh "push" */ > 2: jmp common_interrupt > + CFI_ADJUST_CFA_OFFSET -8 > + .align 2 > + .endif > + > .endr > CFI_ENDPROC > END(irq_entries_start) So I think this might be easier to read if we went low tech and used an open-coded 240+ lines assembly file for this, with a few obvious-purpose helper macros? It's not like it will change all that often so it only has to be generated once. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/