Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752657AbbDCQNI (ORCPT ); Fri, 3 Apr 2015 12:13:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51417 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751262AbbDCQNE (ORCPT ); Fri, 3 Apr 2015 12:13:04 -0400 Message-ID: <551EBC0B.8060406@redhat.com> Date: Fri, 03 Apr 2015 18:12:59 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Ingo Molnar CC: Linus Torvalds , Steven Rostedt , Borislav Petkov , "H. Peter Anvin" , Andy Lutomirski , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , x86@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter References: <1428059634-11782-1-git-send-email-dvlasenk@redhat.com> <20150403140336.GA16422@gmail.com> In-Reply-To: <20150403140336.GA16422@gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9180 Lines: 225 On 04/03/2015 04:03 PM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> Interrupt entry points are handled with the following code: >> Each 32-byte code block contains seven entry points >> >> ... >> [push][jump 22] // 4 bytes >> [push][jump 18] // 4 bytes >> [push][jump 14] // 4 bytes >> [push][jump 10] // 4 bytes >> [push][jump 6] // 4 bytes >> [push][jump 2] // 4 bytes >> [push][jump common_interrupt][padding] // 8 bytes >> >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump common_interrupt][padding] >> >> [padding_2] >> common_interrupt: >> >> The first six jumps are short (2-byte insns) jumps to the last >> jump in the block. The last jump usually has to be a longer, 5-byte form. >> >> There are about 30 such blocks. >> >> This change uses the fact that last few of these blocks are close to >> "common_interrupt" label and the last jump there can also be a short one. >> >> This allows several last 32-byte code blocks to contain eight, not seven entry points, >> and to have all these entry points jump directly to common_interrupt, >> eliminating one branch. They look like this: >> >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> >> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table >> and before common_interrupt with 32-byte alignment. >> >> The first alignment was completely unnecessary: >> the dispatch table will not benefit from any alignment bigger than 32 bytes >> (any entry point needs at most only 32 bytes of table to be read by the CPU). >> >> The alignment before common_interrupt label affects the size of padding >> (the "[padding_2]" above) and excessive padding reduces the number of blocks >> which can be optimized. In the .config I tested with, this alignment was 64 bytes, >> this means two fewer blocks could be optimized. I believe 32-byte alignment >> should do just fine. >> >> The condition which controls the choice of a direct jump versus short one, >> >> /* Can we reach common_interrupt with a short jump? */ >> .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) >> >> was tested to be a passimistic one: the generated code will not erroneously >> generate a longer than necessary jump even with maximum possible padding >> and with -1 subtracted from the right hand side above. >> This was verified by checking disassembly. >> >> However, with current value of FIRST_SYSTEM_VECTOR it so happens >> that the padding before "common_interrupt" is zero bytes. >> >> Disassembly before After >> ========================================== ================================== >> ... ... >> 840 6a9b push $0xffffffffffffff9b 82c 6a9d push $0xffffffffffffff9d >> 842 eb16 jmp 85a 82e eb30 jmp 860 >> 844 6a9a push $0xffffffffffffff9a 830 6a9c push $0xffffffffffffff9c >> 846 eb12 jmp 85a 832 eb2c jmp 860 >> 848 6a99 push $0xffffffffffffff99 834 6a9b push $0xffffffffffffff9b >> 84a eb0e jmp 85a 836 eb28 jmp 860 >> 84c 6a98 push $0xffffffffffffff98 838 6a9a push $0xffffffffffffff9a >> 84e eb0a jmp 85a 83a eb24 jmp 860 >> 850 6a97 push $0xffffffffffffff97 83c 6a99 push $0xffffffffffffff99 >> 852 eb06 jmp 85a 83e eb20 jmp 860 >> 854 6a96 push $0xffffffffffffff96 840 6a98 push $0xffffffffffffff98 >> 856 eb02 jmp 85a 842 eb1c jmp 860 >> 858 6a95 push $0xffffffffffffff95 844 6a97 push $0xffffffffffffff97 >> 85a eb24 jmp 880 846 eb18 jmp 860 >> 85c 0f1f4000 nopl 0x0(%rax) 848 6a96 push $0xffffffffffffff96 >> 860 6a94 push $0xffffffffffffff94 84a eb14 jmp 860 >> 862 eb0c jmp 870 84c 6a95 push $0xffffffffffffff95 >> 864 6a93 push $0xffffffffffffff93 84e eb10 jmp 860 >> 866 eb08 jmp 870 850 6a94 push $0xffffffffffffff94 >> 868 6a92 push $0xffffffffffffff92 852 eb0c jmp 860 >> 86a eb04 jmp 870 854 6a93 push $0xffffffffffffff93 >> 86c 6a91 push $0xffffffffffffff91 856 eb08 jmp 860 >> 86e eb00 jmp 870 858 6a92 push $0xffffffffffffff92 >> 870 eb0e jmp 880 85a eb04 jmp 860 >> 872 66666666662e0f data32 data32 data32 data3285c 6a91 push $0xffffffffffffff91 >> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1) 85e eb00 jmp 860 >> 880 : 860 : >> >> Sizes: >> text data bss dec hex filename >> 12578 0 0 12578 3122 entry_64.o.before >> 12546 0 0 12546 3102 entry_64.o >> >> Signed-off-by: Denys Vlasenko >> CC: Linus Torvalds >> CC: Steven Rostedt >> CC: Ingo Molnar >> CC: Borislav Petkov >> CC: "H. Peter Anvin" >> CC: Andy Lutomirski >> CC: Oleg Nesterov >> CC: Frederic Weisbecker >> CC: Alexei Starovoitov >> CC: Will Drewry >> CC: Kees Cook >> CC: x86@kernel.org >> CC: linux-kernel@vger.kernel.org >> --- >> arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++--------- >> 1 file changed, 49 insertions(+), 12 deletions(-) >> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S >> index da137f9..2003417 100644 >> --- a/arch/x86/kernel/entry_64.S >> +++ b/arch/x86/kernel/entry_64.S >> @@ -537,33 +537,70 @@ END(ret_from_fork) >> * Build the entry stubs and pointer table with some assembler magic. >> * We pack 7 stubs into a single 32-byte chunk, which will fit in a >> * single cache line on all modern x86 implementations. >> + * The last few cachelines pack 8 stubs each. >> */ >> +ALIGN_common_interrupt=32 >> .section .init.rodata,"a" >> ENTRY(interrupt) >> .section .entry.text >> - .p2align 5 >> - .p2align CONFIG_X86_L1_CACHE_SHIFT >> + .align 32 >> ENTRY(irq_entries_start) >> INTR_FRAME >> vector=FIRST_EXTERNAL_VECTOR >> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7 >> - .balign 32 >> +.rept 256 /* this number does not need to be exact, just big enough */ >> + >> + /* >> + * Block of six "push + short_jump" pairs, 4 bytes each, >> + * and a 2-byte seventh "push", without its jump yet. >> + */ >> +need_near_jump=0 >> +push_count=0 >> .rept 7 >> .if vector < FIRST_SYSTEM_VECTOR >> - .if vector <> FIRST_EXTERNAL_VECTOR >> +1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ >> + .previous >> + .quad 1b >> + .section .entry.text >> +vector=vector+1 >> +push_count=push_count+1 >> + .if push_count < 7 >> + /* Can we reach common_interrupt with a short jump? */ >> + .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) >> + jmp common_interrupt /* yes */ >> + .else >> + jmp 2f >> +need_near_jump=1 >> + .endif >> CFI_ADJUST_CFA_OFFSET -8 >> .endif >> + .endif >> + .endr >> + >> + /* If we are close to the end, we can pack in yet another pair */ >> + .if need_near_jump == 0 >> + .if push_count == 7 >> + /* The "short jump" for the seventh "push" */ >> + jmp common_interrupt >> + CFI_ADJUST_CFA_OFFSET -8 >> + .endif >> + .if vector < FIRST_SYSTEM_VECTOR >> + /* "push + short_jump" pair #8 */ >> 1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ >> - .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6 >> - jmp 2f >> - .endif >> - .previous >> + .previous >> .quad 1b >> - .section .entry.text >> + .section .entry.text >> vector=vector+1 >> +push_count=push_count+1 >> + jmp common_interrupt >> + CFI_ADJUST_CFA_OFFSET -8 >> .endif >> - .endr >> + .else >> + /* Use remaining space for "near jump" for the seventh "push" */ >> 2: jmp common_interrupt >> + CFI_ADJUST_CFA_OFFSET -8 >> + .align 2 >> + .endif >> + >> .endr >> CFI_ENDPROC >> END(irq_entries_start) > > So I think this might be easier to read if we went low tech and used > an open-coded 240+ lines assembly file for this, with a few > obvious-purpose helper macros? I tried a few times but the results were monstrous. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/