Date: Fri, 3 Apr 2015 16:03:36 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Steven Rostedt <rostedt@goodmis.org>, Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>,
        Andy Lutomirski <luto@amacapital.net>, Oleg Nesterov <oleg@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Alexei Starovoitov <ast@plumgrid.com>, Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>, x86@kernel.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter
Message-ID: <20150403140336.GA16422@gmail.com>
References: <1428059634-11782-1-git-send-email-dvlasenk@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1428059634-11782-1-git-send-email-dvlasenk@redhat.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8977
Lines: 227


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> Interrupt entry points are handled with the following code:
> Each 32-byte code block contains seven entry points
> 
> 		...
> 		[push][jump 22] // 4 bytes
> 		[push][jump 18] // 4 bytes
> 		[push][jump 14] // 4 bytes
> 		[push][jump 10] // 4 bytes
> 		[push][jump  6] // 4 bytes
> 		[push][jump  2] // 4 bytes
> 		[push][jump common_interrupt][padding] // 8 bytes
> 
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump]
> 		[push][jump common_interrupt][padding]
> 
> 		[padding_2]
> 	common_interrupt:
> 
> The first six jumps are short (2-byte insns) jumps to the last
> jump in the block. The last jump usually has to be a longer, 5-byte form.
> 
> There are about 30 such blocks.
> 
> This change uses the fact that last few of these blocks are close to
> "common_interrupt" label and the last jump there can also be a short one.
> 
> This allows several last 32-byte code blocks to contain eight, not seven entry points,
> and to have all these entry points jump directly to common_interrupt,
> eliminating one branch. They look like this:
> 
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 		[push][jump common_interrupt] // 4 bytes
> 
> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
> and before common_interrupt with 32-byte alignment.
> 
> The first alignment was completely unnecessary:
> the dispatch table will not benefit from any alignment bigger than 32 bytes
> (any entry point needs at most only 32 bytes of table to be read by the CPU).
> 
> The alignment before common_interrupt label affects the size of padding
> (the "[padding_2]" above) and excessive padding reduces the number of blocks
> which can be optimized. In the .config I tested with, this alignment was 64 bytes,
> this means two fewer blocks could be optimized. I believe 32-byte alignment
> should do just fine.
> 
> The condition which controls the choice of a direct jump versus short one,
> 
>     /* Can we reach common_interrupt with a short jump? */
>     .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
> 
> was tested to be a passimistic one: the generated code will not erroneously
> generate a longer than necessary jump even with maximum possible padding
> and with -1 subtracted from the right hand side above.
> This was verified by checking disassembly.
> 
> However, with current value of FIRST_SYSTEM_VECTOR it so happens
> that the padding before "common_interrupt" is zero bytes.
> 
>  Disassembly before                           After
>  ==========================================   ==================================
> ...                                           ...
> 840 6a9b           push $0xffffffffffffff9b   82c 6a9d  push $0xffffffffffffff9d
> 842 eb16           jmp  85a                   82e eb30  jmp  860
> 844 6a9a           push $0xffffffffffffff9a   830 6a9c  push $0xffffffffffffff9c
> 846 eb12           jmp  85a                   832 eb2c  jmp  860
> 848 6a99           push $0xffffffffffffff99   834 6a9b  push $0xffffffffffffff9b
> 84a eb0e           jmp  85a                   836 eb28  jmp  860
> 84c 6a98           push $0xffffffffffffff98   838 6a9a  push $0xffffffffffffff9a
> 84e eb0a           jmp  85a                   83a eb24  jmp  860
> 850 6a97           push $0xffffffffffffff97   83c 6a99  push $0xffffffffffffff99
> 852 eb06           jmp  85a                   83e eb20  jmp  860
> 854 6a96           push $0xffffffffffffff96   840 6a98  push $0xffffffffffffff98
> 856 eb02           jmp  85a                   842 eb1c  jmp  860
> 858 6a95           push $0xffffffffffffff95   844 6a97  push $0xffffffffffffff97
> 85a eb24           jmp  880                   846 eb18  jmp  860
> 85c 0f1f4000       nopl 0x0(%rax)             848 6a96  push $0xffffffffffffff96
> 860 6a94           push $0xffffffffffffff94   84a eb14  jmp  860
> 862 eb0c           jmp  870                   84c 6a95  push $0xffffffffffffff95
> 864 6a93           push $0xffffffffffffff93   84e eb10  jmp  860
> 866 eb08           jmp  870                   850 6a94  push $0xffffffffffffff94
> 868 6a92           push $0xffffffffffffff92   852 eb0c  jmp  860
> 86a eb04           jmp  870                   854 6a93  push $0xffffffffffffff93
> 86c 6a91           push $0xffffffffffffff91   856 eb08  jmp  860
> 86e eb00           jmp  870                   858 6a92  push $0xffffffffffffff92
> 870 eb0e           jmp  880                   85a eb04  jmp  860
> 872 66666666662e0f data32 data32 data32 data3285c 6a91  push $0xffffffffffffff91
> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1)  85e eb00  jmp  860
> 880 <common_interrupt>:                       860 <common_interrupt>:
> 
>  Sizes:
>    text	   data	    bss	    dec	    hex	filename
>   12578	      0	      0	  12578	   3122	entry_64.o.before
>   12546	      0	      0	  12546	   3102	entry_64.o
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index da137f9..2003417 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -537,33 +537,70 @@ END(ret_from_fork)
>   * Build the entry stubs and pointer table with some assembler magic.
>   * We pack 7 stubs into a single 32-byte chunk, which will fit in a
>   * single cache line on all modern x86 implementations.
> + * The last few cachelines pack 8 stubs each.
>   */
> +ALIGN_common_interrupt=32
>  	.section .init.rodata,"a"
>  ENTRY(interrupt)
>  	.section .entry.text
> -	.p2align 5
> -	.p2align CONFIG_X86_L1_CACHE_SHIFT
> +	.align 32
>  ENTRY(irq_entries_start)
>  	INTR_FRAME
>  vector=FIRST_EXTERNAL_VECTOR
> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
> -	.balign 32
> +.rept 256 /* this number does not need to be exact, just big enough */
> +
> +  /*
> +   * Block of six "push + short_jump" pairs, 4 bytes each,
> +   * and a 2-byte seventh "push", without its jump yet.
> +   */
> +need_near_jump=0
> +push_count=0
>    .rept	7
>      .if vector < FIRST_SYSTEM_VECTOR
> -      .if vector <> FIRST_EXTERNAL_VECTOR
> +1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
> +	.previous
> +	.quad 1b
> +	.section .entry.text
> +vector=vector+1
> +push_count=push_count+1
> +      .if push_count < 7
> +	/* Can we reach common_interrupt with a short jump? */
> +	.if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
> +	  jmp common_interrupt  /* yes */
> +	.else
> +	  jmp 2f
> +need_near_jump=1
> +	.endif
>  	CFI_ADJUST_CFA_OFFSET -8
>        .endif
> +    .endif
> +  .endr
> +
> +  /* If we are close to the end, we can pack in yet another pair */
> +  .if need_near_jump == 0
> +    .if push_count == 7
> +	/* The "short jump" for the seventh "push" */
> +	jmp common_interrupt
> +	CFI_ADJUST_CFA_OFFSET -8
> +    .endif
> +    .if vector < FIRST_SYSTEM_VECTOR
> +	/* "push + short_jump" pair #8 */
>  1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
> -      .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
> -	jmp 2f
> -      .endif
> -      .previous
> +	.previous
>  	.quad 1b
> -      .section .entry.text
> +	.section .entry.text
>  vector=vector+1
> +push_count=push_count+1
> +	jmp common_interrupt
> +	CFI_ADJUST_CFA_OFFSET -8
>      .endif
> -  .endr
> +  .else
> +	/* Use remaining space for "near jump" for the seventh "push" */
>  2:	jmp common_interrupt
> +	CFI_ADJUST_CFA_OFFSET -8
> +	.align 2
> +  .endif
> +
>  .endr
>  	CFI_ENDPROC
>  END(irq_entries_start)

So I think this might be easier to read if we went low tech and used 
an open-coded 240+ lines assembly file for this, with a few 
obvious-purpose helper macros? It's not like it will change all that 
often so it only has to be generated once.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/